diff --git a/keps/prod-readiness/sig-node/4381.yaml b/keps/prod-readiness/sig-node/4381.yaml index 93a3965f342..dcbe15c2fc8 100644 --- a/keps/prod-readiness/sig-node/4381.yaml +++ b/keps/prod-readiness/sig-node/4381.yaml @@ -4,3 +4,5 @@ kep-number: 4381 alpha: approver: "@johnbelamaric" +beta: + approver: "@johnbelamaric" diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 70a4d3d57e9..7bb138ca522 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -952,33 +952,36 @@ when the feature is disabled. ### Rollout, Upgrade and Rollback Planning -###### How can a rollout fail? Can it impact already running workloads? + -Workloads not using ResourceClaims should not be impacted because the new code -will not do anything besides checking the Pod for ResourceClaims. +###### How can a rollout fail? Can it impact already running workloads? -When kube-controller-manager fails to create ResourceClaims from -ResourceClaimTemplates, those Pods will not get scheduled. Bugs in -kube-scheduler might lead to not scheduling Pods that could run or worse, -schedule Pods that should not run. Those then will get stuck on a node where -kubelet will refuse to start them. None of these scenarios affect already -running workloads. + ###### What specific metrics should inform a rollback? - -One indicator are unexpected restarts of the cluster control plane -components. Another are an increase in the number of pods that fail to -start. In both cases further analysis of logs and pod events is needed to -determine whether errors are related to this feature. + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? -This will be done manually before transition to beta by bringing up a KinD -cluster with kubeadm and changing the feature gate for individual components. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? @@ -988,57 +991,32 @@ No. ###### How can an operator determine if the feature is in use by workloads? -There will be pods which have a non-empty PodSpec.ResourceClaims field and ResourceClaim objects. - -###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? +There will be ResourceClaim objects with `spec.controllerName` set. -For kube-controller-manager, metrics similar to the generic ephemeral volume -controller [were added](https://github.com/kubernetes/kubernetes/blob/163553bbe0a6746e7719380e187085cf5441dfde/pkg/controller/resourceclaim/metrics/metrics.go#L32-L47): +###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? -- [X] Metrics - - Metric name: `resource_controller_create_total` - - Metric name: `resource_controller_create_failures_total` - - Metric name: `workqueue` with `name="resource_claim"` + -For kube-scheduler and kubelet, existing metrics for handling Pods already -cover most aspects. For example, in the scheduler the -["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/6f5fa2eb2f4dc731243b00f7e781e95589b5621f/pkg/scheduler/metrics/metrics.go#L200-L206) -metric will call out pods that are currently unschedulable because of the -`DynamicResources` plugin. +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? For the communication between scheduler and controller, the apiserver metrics about API calls (e.g. `request_total`, `request_duration_seconds`) for the `podschedulingcontexts` and `resourceclaims` resources provide insights into the amount of requests and how long they are taking. -###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? - -For Pods not using ResourceClaims, the same SLOs apply as before. - -For kube-controller-manager, metrics for the new controller could be checked to -ensure that work items do not remain in the queue for too long, for some -definition of "too long". - -Pod scheduling and startup are more important. However, expected performance -will depend on how resources are used (for example, how often new Pods are -created), therefore it is impossible to predict what reasonable SLOs might be. - -The resource manager component will do its work similarly to the -existing volume manager, but the overhead and complexity should -be lower: - -* Resource preparation should be fairly quick as in most cases it simply - creates CDI file 1-3 Kb in size. Unpreparing resource usually means - deleting CDI file, so it should be quick as well. - -* The complexity is lower than in the volume manager - because there is only one global operation needed (prepare vs. - attach + publish for each pod). - -* Reconstruction after a kubelet restart is simpler (call - NodePrepareResource again vs. trying to determine whether - volumes are mounted). - ###### Are there any missing metrics that would be useful to have to improve observability of this feature? No. @@ -1055,31 +1033,27 @@ A third-party resource driver is required for allocating resources. ###### Will enabling / using this feature result in any new API calls? -For Pods not using ResourceClaims, not much changes. kube-controller-manager, -kube-scheduler and kubelet will have additional watches for ResourceClaim and -ResourceClass, but if the feature isn't used, those watches -will not cause much overhead. - -If the feature is used, ResourceClaim will be modified during Pod scheduling, -startup and teardown by kube-scheduler, the third-party resource driver and -kubelet. Once a ResourceClaim is allocated and the Pod runs, there will be no -periodic API calls. How much this impacts performance of the apiserver -therefore mostly depends on how often this feature is used for new -ResourceClaims and Pods. Because it is meant for long-running applications, the -impact should not be too high. + ###### Will enabling / using this feature result in introducing new API types? -For ResourceClass, only a few (something like 10 to 20) -objects per cluster are expected. Admins need to create those. - -The number of ResourceClaim objects depends on how much the feature is -used. They are namespaced and get created directly or indirectly by users. In -the most extreme case, there will be one or more ResourceClaim for each Pod. -But that seems unlikely for the intended use cases. - -Kubernetes itself will not impose specific limitations for the number of these -objects. + ###### Will enabling / using this feature result in any new calls to the cloud provider? @@ -1087,61 +1061,75 @@ Only if the third-party resource driver uses features of the cloud provider. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? -The PodSpec potentially changes and thus all objects where it is embedded as -template. Merely enabling the feature does not change the size, only using it -does. - -In the simple case, a Pod references existing ResourceClaims by name, which -will add some short strings to the PodSpec and to the ContainerSpec. Embedding -a ResourceClaimTemplate will increase the size more, but that will depend on -the number of custom parameters supported by a resource driver and thus is hard to -predict. - -The ResourceClaim objects will initially be fairly small. However, if delayed -allocation is used, then the list of node names or NodeSelector instances -inside it might become rather large and in the worst case will scale with the -number of nodes in the cluster. + ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? -Startup latency of schedulable stateless pods may be affected by enabling the -feature because some CPU cycles are needed for each Pod to determine whether it -uses ResourceClaims. + ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? -Merely enabling the feature is not expected to increase resource usage much. + -How much using it will increase resource usage depends on the usage patterns -and is hard to predict. +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + ### Troubleshooting + + ###### How does this feature react if the API server and/or etcd is unavailable? The Kubernetes control plane will be down, so no new Pods get -scheduled. kubelet may still be able to start or or restart containers if it -already received all the relevant updates (Pod, ResourceClaim, etc.). +scheduled. ###### What are other known failure modes? -- DRA driver does not or cannot allocate a resource claim. +- DRA driver controller does not or cannot allocate a resource claim. - Detection: The primary mechanism is through vendors-provided monitoring for their driver. That monitor needs to include health of the driver, availability of the underlying resource, etc. The common helper code for DRA drivers posts events for a ResourceClaim when an allocation attempt fails. - When pods fail to get scheduled, kube-scheduler reports that through events - and pod status. For DRA, that includes "waiting for resource driver to - provide information" (node not selected yet) and "waiting for resource - driver to allocate resource" (node has been selected). The - ["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/9fca4ec44afad4775c877971036b436eef1a1759/pkg/scheduler/metrics/metrics.go#L200-L206) - metric will have pods counted under the "dynamicresources" plugin label. - To troubleshoot, "kubectl describe" can be used on (in this order) Pod, ResourceClaim, PodSchedulingContext. @@ -1157,50 +1145,6 @@ already received all the relevant updates (Pod, ResourceClaim, etc.). resources in one driver and then failing to allocate the remaining resources in another driver (the "need to deallocate" fallback). -- A Pod gets scheduled without allocating resources. - - - Detection: The Pod either fails to start (when kubelet has DRA - enabled) or gets started without the resources (when kubelet doesn't - have DRA enabled), which then will fail in an application specific - way. - - - Mitigations: DRA must get enabled properly in kubelet and kube-controller-manager. - Then kube-controller-manager will try to allocate and reserve resources for - already scheduled pods. To prevent this from happening for new pods, DRA - must get enabled in kube-scheduler. - - - Diagnostics: kubelet will log pods without allocated resources as errors - and emit events for them. - - - Testing: An E2E test covers the expected behavior of kubelet and - kube-controller-manager by creating a pod with `spec.nodeName` already set. - -- A DRA driver kubelet plugin fails to prepare resources. - - - Detection: The Pod fails to start after being scheduled. - - - Mitigations: This depends on the specific DRA driver and has to be documented - by vendors. - - - Diagnostics: kubelet will log pods with such errors and emit events for them. - - - Testing: An E2E test covers the expected retry mechanism in kubelet when - `NodePrepareResources` fails intermittently. - - - - ###### What steps should be taken if SLOs are not being met to determine the problem? Performance depends on a large extend on how individual DRA drivers are diff --git a/keps/sig-node/4381-dra-structured-parameters/README.md b/keps/sig-node/4381-dra-structured-parameters/README.md index 309fe6c6311..454a5fa3576 100644 --- a/keps/sig-node/4381-dra-structured-parameters/README.md +++ b/keps/sig-node/4381-dra-structured-parameters/README.md @@ -400,7 +400,7 @@ root privileges that does some cluster-specific initialization of a device each time it is prepared on a node: ```yaml -apiVersion: resource.k8s.io/v1alpha3 +apiVersion: resource.k8s.io/v1beta1 kind: DeviceClass metadata: name: acme-gpu @@ -440,7 +440,7 @@ For a simple trial, I create a Pod directly where two containers share the same of the GPU: ```yaml -apiVersion: resource.k8s.io/v1alpha2 +apiVersion: resource.k8s.io/v1beta1 kind: ResourceClaimTemplate metadata: name: device-consumer-gpu-template @@ -542,7 +542,7 @@ Embedded inside each `ResourceSlice` is a list of one or more devices, each of w ```yaml kind: ResourceSlice -apiVersion: resource.k8s.io/v1alpha3 +apiVersion: resource.k8s.io/v1beta1 ... spec: # The node name indicates the node. @@ -940,13 +940,27 @@ set when it is enabled. Initially, they are declared as alpha. Even though they are alpha, changes to their schema are discouraged and would have to be done by using new field names. -ResourceClaim, DeviceClass and ResourceClaimTemplate are new built-in types -in `resource.k8s.io/v1alpha3`. This alpha group must be explicitly enabled in +After promotion to beta they are still disabled by default unless the feature +gate explicitly gets enabled. The feature gate remains off by default because +DRA depends on a new API group which following the +[convention](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/3136-beta-apis-off-by-default) +is off by default. + +ResourceClaim, DeviceClass and ResourceClaimTemplate are built-in types +in `resource.k8s.io/v1beta1`. This beta group must be explicitly enabled in the apiserver's runtime configuration. Using builtin types was chosen instead of using CRDs because core Kubernetes components must interact with the new objects and installation of CRDs as part of cluster creation is an unsolved problem. +The storage version of this API group is `v1beta1`. This enables a potential +future removal of the `v1alpha3` version. `v1alpha3` is still supported for +clients via conversion. This enables version skew testing (kubelet from 1.31 +with 1.32 control plane, incremental update) and makes DRA drivers written for +1.31 immediately usable with 1.32. Cluster upgrades from 1.31 are supported, +downgrades only if DRA is not enabled in the downgraded cluster or no resources +exist in the cluster which use the `v1beta1` format. + Secrets are not part of this API: if a DRA driver needs secrets, for example to access its own backplane, then it can define custom parameters for those secrets and retrieve them directly from the apiserver. This works because @@ -1159,6 +1173,10 @@ type BasicDevice struct { // Attributes defines the set of attributes for this device. // The name of each attribute must be unique in that set. // + // To ensure this uniqueness, attributes defined by the vendor + // must be listed without the driver name as domain prefix in + // their name. All others must be listed with their domain prefix. + // // The maximum number of attributes and capacities combined is 32. // // +optional @@ -1167,13 +1185,18 @@ type BasicDevice struct { // Capacity defines the set of capacities for this device. // The name of each capacity must be unique in that set. // + // To ensure this uniqueness, capacities defined by the vendor + // must be listed without the driver name as domain prefix in + // their name. All others must be listed with their domain prefix. + // // The maximum number of attributes and capacities combined is 32. // // +optional - Capacity map[QualifiedName]resource.Quantity + Capacity map[QualifiedName]DeviceCapacity } -// Limit for the sum of the number of entries in both ResourceSlices. +// Limit for the sum of the number of entries in both ResourceSlices.Attributes +// and ResourceSlices.Capacity. const ResourceSliceMaxAttributesAndCapacitiesPerDevice = 32 // QualifiedName is the name of a device attribute or capacity. @@ -1189,6 +1212,7 @@ const ResourceSliceMaxAttributesAndCapacitiesPerDevice = 32 // domain prefix are assumed to be part of the driver's domain. Attributes // or capacities defined by 3rd parties must include the domain prefix. // +// // The maximum length for the DNS subdomain is 63 characters (same as // for driver names) and the maximum length of the C identifier // is 32. @@ -1234,8 +1258,22 @@ type DeviceAttribute struct { // DeviceAttributeMaxValueLength is the maximum length of a string or version attribute value. const DeviceAttributeMaxValueLength = 64 + +// DeviceCapacity is a single entry in [BasicDevice.Capacity]. +type DeviceCapacity struct { + // Quantity defines how much of a certain device capacity is available. + Quantity resource.Quantity + + // potential future addition: fields which define how to "consume" + // capacity (= share a single device between different consumers). ``` +The `v1alpha3` API directly mapped to a `resource.Quantity` instead of this +`DeviceCapacity`. Semantically the two are currently equivalent, therefore +custom conversion code makes it possible to continue supporting `v1alpha3`. At +the time that "consumable capacity" gets added (if it gets added!) the alpha +API probably can be removed because all clients will use the beta API. + ###### ResourceClaim @@ -1777,6 +1815,17 @@ type DeviceRequestAllocationResult struct { // // +required Device string + + // AdminAccess is a copy of the AdminAccess value in the + // request which caused this device to be allocated. + // + // New allocations are required to have this set. Old allocations made + // by Kubernetes 1.31 do not have it yet. Clients which want to + // support Kubernetes 1.31 need to look up the request and retrieve + // the value from there if this field is not set. + // + // +required + AdminAccess *bool } // DeviceAllocationConfiguration gets embedded in an AllocationResult. @@ -2166,6 +2215,10 @@ k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by [CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md), with “volume” replaced by “resource” and volume specific parts removed. +Versions v1alpha4 and v1beta1 are supported by kubelet. Both are identical. +DRA drivers should implement both because support for v1alpha4 might get +removed. + #### Version skew Previously, kubelet retrieved ResourceClaims and published ResourceSlices on @@ -2252,7 +2305,7 @@ MAY choose to call `NodePrepareResource` again, or choose to call On a successful call this RPC should return set of fully qualified CDI device names, which kubelet MUST pass to the runtime through the CRI -protocol. For version v1alpha3, the RPC should return multiple sets of +protocol. As of v1alpha3, the RPC should return multiple sets of fully qualified CDI device names, one per claim that was sent in the input parameters. ```protobuf @@ -2518,6 +2571,20 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> +The scheduler plugin and resource claim controller are covered by the workloads +in +https://github.com/kubernetes/kubernetes/blob/master/test/integration/scheduler_perf/config/performance-config.yaml +with "Structured" in the name. Those tests run in: + +- [pre-submit](https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-integration) and [periodic](https://testgrid.k8s.io/sig-release-master-blocking#integration-master) integration testing under `k8s.io/kubernetes/test/integration/scheduler_perf.scheduler_perf` +- [periodic performance testing](https://testgrid.k8s.io/sig-scalability-benchmarks#scheduler-perf) which populates [http://perf-dash.k8s.io](http://perf-dash.k8s.io/#/?jobname=scheduler-perf-benchmark&metriccategoryname=Scheduler&metricname=BenchmarkPerfResults&Metric=SchedulingThroughput&Name=SchedulingBasic%2F5000Nodes_10000Pods%2Fnamespace-2&extension_point=not%20applicable&plugin=not%20applicable&result=not%20applicable&event=not%20applicable) + +**TODO**: link to "Structured" once https://github.com/kubernetes/kubernetes/pull/127277) is merged. + +The periodic performance testing prevents performance regressions by tracking +performance over time and by failing the test if performance drops below a +threshold defined for each workload. + ##### e2e tests - ###### How can a rollout or rollback fail? Can it impact already running workloads? - +When kube-controller-manager fails to create ResourceClaims from +ResourceClaimTemplates, those Pods will not get scheduled. Bugs in +kube-scheduler might lead to not scheduling Pods that could run or worse, +schedule Pods that should not run. Those then will get stuck on a node where +kubelet will refuse to start them. None of these scenarios affect already +running workloads. + +Failures in kubelet might affect running workloads, but only if containers for +those workloads need to be restarted. ###### What specific metrics should inform a rollback? - +One indicator are unexpected restarts of the cluster control plane +components. Another are an increase in the number of pods that fail to +start. In both cases further analysis of logs and pod events is needed to +determine whether errors are related to this feature. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? - +This will be done manually before transition to beta by bringing up a KinD +cluster with kubeadm and changing the feature gate for individual components. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? - +No. ### Monitoring Requirements @@ -2689,90 +2747,74 @@ previous answers based on experience in the field. ###### How can an operator determine if the feature is in use by workloads? - +There will be pods which have a non-empty PodSpec.ResourceClaims field and ResourceClaim objects. -Metrics in kube-scheduler (names to be decided): -- number of classes using structured parameters +Metrics in kube-controller-manager (names to be decided): +- number of claims using structured parameters - number of claims which currently are allocated with structured parameters ###### How can someone using this feature know that it is working for their instance? - - - [X] API .status - Other field: ".status.allocation" will be set for a claim using structured parameters when needed by a pod. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? - +For Pods not using ResourceClaims, the same SLOs apply as before. + +For kube-controller-manager, metrics for the new controller could be checked to +ensure that work items do not remain in the queue for too long, for some +definition of "too long". + +Pod scheduling and startup are more important. However, expected performance +will depend on how resources are used (for example, how often new Pods are +created), therefore it is impossible to predict what reasonable SLOs might be. + +The resource manager component will do its work similarly to the +existing volume manager, but the overhead and complexity should +be lower: + +* Resource preparation should be fairly quick as in most cases it simply + creates CDI file 1-3 Kb in size. Unpreparing resource usually means + deleting CDI file, so it should be quick as well. + +* The complexity is lower than in the volume manager + because there is only one global operation needed (prepare vs. + attach + publish for each pod). + +* Reconstruction after a kubelet restart is simpler (call + NodePrepareResource again vs. trying to determine whether + volumes are mounted). ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - +For kube-controller-manager, metrics similar to the generic ephemeral volume +controller [were added](https://github.com/kubernetes/kubernetes/blob/163553bbe0a6746e7719380e187085cf5441dfde/pkg/controller/resourceclaim/metrics/metrics.go#L32-L47): + +- [X] Metrics + - Metric name: `resource_controller_create_total` + - Metric name: `resource_controller_create_failures_total` + - Metric name: `workqueue` with `name="resource_claim"` -- [ ] Metrics - - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: -- [ ] Other (treat as last resort) - - Details: +For kube-scheduler and kubelet, existing metrics for handling Pods already +cover most aspects. For example, in the scheduler the +["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/6f5fa2eb2f4dc731243b00f7e781e95589b5621f/pkg/scheduler/metrics/metrics.go#L200-L206) +metric will call out pods that are currently unschedulable because of the +`DynamicResources` plugin. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? - +No. ### Dependencies - +The container runtime must support CDI. ###### Does this feature depend on any specific services running in the cluster? - +A third-party DRA driver is required for publishing resource information and +preparing resources on a node. ### Scalability @@ -2788,79 +2830,76 @@ previous answers based on experience in the field. ###### Will enabling / using this feature result in any new API calls? - +For Pods not using ResourceClaims, not much changes. The following components +need to watch additional resources: +- kube-controller-manager: ResourceClaimTemplate +- kube-scheduler: ResourceClaim, DeviceClass, ResourceSlice +- kubelet: ResourceClaim + +If the feature isn't used, those watches will not cause much overhead. + +If the feature is used, kube-scheduler needs to update ResourceClaim during Pod +scheduling kube-scheduler. kube-controller-manager creates it (optional, only +when using ResourceClaimTemplate) during Pod creation and updates and +optionally deletes it after Pod termination. + +Once a ResourceClaim is allocated and the Pod runs, there will be no periodic +API calls. How much this impacts performance of the apiserver therefore mostly +depends on how often this feature is used for new ResourceClaims and +Pods. Because it is meant for long-running applications, the impact should not +be too high. ###### Will enabling / using this feature result in introducing new API types? - +For DeviceClass, only a few (something like 10 to 20) +objects per cluster are expected. Admins need to create those. + +The number of ResourceClaim objects depends on how much the feature is +used. They are namespaced and get created directly or indirectly by users. In +the most extreme case, there will be one or more ResourceClaim for each Pod. +But that seems unlikely for the intended use cases. + +How many ResourcSlice objects get published depends on third-party drivers and +how much hardware is installed in the cluster. Typically, each driver will +publish one ResourceSlice per node where it manages hardware. + +Kubernetes itself will not impose specific limitations for the number of these +objects. ###### Will enabling / using this feature result in any new calls to the cloud provider? - +Only if the third-party resource driver uses features of the cloud provider. + ###### Will enabling / using this feature result in increasing size or count of the existing API objects? - +The PodSpec potentially changes and thus all objects where it is embedded as +template. Merely enabling the feature does not change the size, only using it +does. -###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +In the simple case, a Pod references an existing ResourceClaim or +ResourceClaimTemplate by name, which will add some short strings to the PodSpec +and to the ContainerSpec. - +Actively using the feature will increase load on the apiserver, so latency of +API calls may get affected. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? - +How much using it will increase resource usage depends on the usage patterns +and is hard to predict. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? - +The kubelet needs a gRPC connection to each DRA driver running on the node. ### Troubleshooting @@ -2877,20 +2916,70 @@ details). For now, we leave it here. ###### How does this feature react if the API server and/or etcd is unavailable? +The Kubernetes control plane will be down, so no new Pods get +scheduled. kubelet may still be able to start or or restart containers if it +already received all the relevant updates (Pod, ResourceClaim, etc.). + + ###### What are other known failure modes? - +- kube-scheduler cannot allocate ResourceClaims. + + - Detection: When pods fail to get scheduled, kube-scheduler reports that + through events and pod status. For DRA, messages include "cannot allocate + all claims" (insufficient resources) and "ResourceClaim not created yet" + (user or kube-controller-manager haven't created the ResourceClaim yet). + The + ["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/9fca4ec44afad4775c877971036b436eef1a1759/pkg/scheduler/metrics/metrics.go#L200-L206) + metric will have pods counted under the "dynamicresources" plugin label. + + To troubleshoot, "kubectl describe" can be used on (in this order) Pod + and ResourceClaim. + + - Mitigations: When resources should be available but don't get + advertised in ResoureSlices, debugging must focus on the DRA driver, + with trouble-shooting instructions provided by the vendor. + + When ResourceClaims for ResourceClaimTemplates don't get created, the log + output of the kube-controller-manager will have more information. + + - Diagnostics: In kube-scheduler, -v=4 enables simple progress reporting + in the "dynamicresources" plugin. -v=5 provides more information about + each plugin method. The special status results mentioned above also get + logged. + + - Testing: E2E testing covers various scenarios that involve waiting + for a DRA driver. This also simulates partial allocation of node-local + resources in one driver and then failing to allocate the remaining + resources in another driver (the "need to deallocate" fallback). + +- A Pod gets scheduled without allocating resources. + + - Detection: The Pod either fails to start (when kubelet has DRA + enabled) or gets started without the resources (when kubelet doesn't + have DRA enabled), which then will fail in an application specific + way. + + - Mitigations: DRA must get enabled properly in kubelet and kube-controller-manager. + + - Diagnostics: kubelet will log pods without allocated resources as errors + and emit events for them. + + - Testing: An E2E test covers the expected behavior of kubelet and + kube-controller-manager by creating a pod with `spec.nodeName` already set. + +- A DRA driver kubelet plugin fails to prepare resources. + + - Detection: The Pod fails to start after being scheduled. + + - Mitigations: This depends on the specific DRA driver and has to be documented + by vendors. + + - Diagnostics: kubelet will log pods with such errors and emit events for them. + + - Testing: An E2E test covers the expected retry mechanism in kubelet when + `NodePrepareResources` fails intermittently. + ###### What steps should be taken if SLOs are not being met to determine the problem? diff --git a/keps/sig-node/4381-dra-structured-parameters/kep.yaml b/keps/sig-node/4381-dra-structured-parameters/kep.yaml index 08b6dcf1c4b..d030713d6ce 100644 --- a/keps/sig-node/4381-dra-structured-parameters/kep.yaml +++ b/keps/sig-node/4381-dra-structured-parameters/kep.yaml @@ -20,16 +20,17 @@ see-also: - "/keps/sig-node/3063-dynamic-resource-allocation" # The target maturity stage in the current dev cycle for this KEP. -stage: alpha +stage: beta # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.31" +latest-milestone: "v1.32" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: alpha: "v1.30" + beta: "v1.32" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled @@ -44,4 +45,6 @@ disable-supported: true # The following PRR answers are required at beta release metrics: - # - my_feature_metric + - resource_controller_create_total + - resource_controller_create_failures_total + - resource controller workqueue with name="resource_claim"