DRA: PRR and API for beta of structured parameters in 1.32

Much of the PRR text that was originally written for "classic DRA" applies also to "structured parameters". It gets moved from kubernetes#3063 to kubernetes#4381, with some minor adaptions. The placeholder comments get restored in kubernetes#3063 because further work on the KEP would be needed to move it forward - if it gets moved forward at all instead of being abandoned. The v1beta1 API will be almost identical to the v1alpha3 API, with just some minor tweaks to fix oversights. The kubelet gRPC gets bumped with no changes. Nonetheless, drivers should get updated, which can be done by updating the Go dependencies and optionally changing the API import.
pohly · Sep 24, 2024 · 3b937b9 · 3b937b9
1 parent a5ecee1
commit 3b937b9
Show file tree

Hide file tree

Showing 3 changed files with 360 additions and 324 deletions.
diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md
@@ -952,33 +952,36 @@ when the feature is disabled.
 
 ### Rollout, Upgrade and Rollback Planning
 
-###### How can a rollout fail? Can it impact already running workloads?
+<!--
+This section must be completed when targeting beta to a release.
+-->
 
-Workloads not using ResourceClaims should not be impacted because the new code
-will not do anything besides checking the Pod for ResourceClaims.
+###### How can a rollout fail? Can it impact already running workloads?
 
-When kube-controller-manager fails to create ResourceClaims from
-ResourceClaimTemplates, those Pods will not get scheduled. Bugs in
-kube-scheduler might lead to not scheduling Pods that could run or worse,
-schedule Pods that should not run. Those then will get stuck on a node where
-kubelet will refuse to start them. None of these scenarios affect already
-running workloads.
+<!--
+Try to be as paranoid as possible - e.g., what if some components will restart
+mid-rollout?
 
-Failures in kubelet might affect running workloads, but only if containers for
-those workloads need to be restarted.
+Be sure to consider highly-available clusters, where, for example,
+feature flags will be enabled on some API servers and not others during the
+rollout. Similarly, consider large clusters and how enablement/disablement
+will rollout across nodes.
+-->
 
 ###### What specific metrics should inform a rollback?
 
-
-One indicator are unexpected restarts of the cluster control plane
-components. Another are an increase in the number of pods that fail to
-start. In both cases further analysis of logs and pod events is needed to
-determine whether errors are related to this feature.
+<!--
+What signals should users be paying attention to when the feature is young
+that might indicate a serious problem?
+-->
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
-This will be done manually before transition to beta by bringing up a KinD
-cluster with kubeadm and changing the feature gate for individual components.
+<!--
+Describe manual testing that was done and the outcomes.
+Longer term, we may want to require automated upgrade/rollback tests, but we
+are missing a bunch of machinery and tooling and can't do that now.
+-->
 
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
@@ -988,57 +991,32 @@ No.
 
 ###### How can an operator determine if the feature is in use by workloads?
 
-There will be pods which have a non-empty PodSpec.ResourceClaims field and ResourceClaim objects.
-
-###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+There will be ResourceClaim objects with `spec.controllerName` set.
 
-For kube-controller-manager, metrics similar to the generic ephemeral volume
-controller [were added](https://github.com/kubernetes/kubernetes/blob/163553bbe0a6746e7719380e187085cf5441dfde/pkg/controller/resourceclaim/metrics/metrics.go#L32-L47):
+###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
 
-- [X] Metrics
-  - Metric name: `resource_controller_create_total`
-  - Metric name: `resource_controller_create_failures_total`
-  - Metric name: `workqueue` with `name="resource_claim"`
+<!--
+This is your opportunity to define what "normal" quality of service looks like
+for a feature.
+
+It's impossible to provide comprehensive guidance, but at the very
+high level (needs more precise definitions) those may be things like:
+  - per-day percentage of API calls finishing with 5XX errors <= 1%
+  - 99% percentile over day of absolute value from (job creation time minus expected
+    job creation time) for cron job <= 10%
+  - 99.9% of /health requests per day finish with 200 code
+
+These goals will help you determine what you need to measure (SLIs) in the next
+question.
+-->
 
-For kube-scheduler and kubelet, existing metrics for handling Pods already
-cover most aspects. For example, in the scheduler the
-["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/6f5fa2eb2f4dc731243b00f7e781e95589b5621f/pkg/scheduler/metrics/metrics.go#L200-L206)
-metric will call out pods that are currently unschedulable because of the
-`DynamicResources` plugin.
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
 For the communication between scheduler and controller, the apiserver metrics
 about API calls (e.g. `request_total`, `request_duration_seconds`) for the
 `podschedulingcontexts` and `resourceclaims` resources provide insights into
 the amount of requests and how long they are taking.
 
-###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
-
-For Pods not using ResourceClaims, the same SLOs apply as before.
-
-For kube-controller-manager, metrics for the new controller could be checked to
-ensure that work items do not remain in the queue for too long, for some
-definition of "too long".
-
-Pod scheduling and startup are more important. However, expected performance
-will depend on how resources are used (for example, how often new Pods are
-created), therefore it is impossible to predict what reasonable SLOs might be.
-
-The resource manager component will do its work similarly to the
-existing volume manager, but the overhead and complexity should
-be lower:
-
-* Resource preparation should be fairly quick as in most cases it simply
-  creates CDI file 1-3 Kb in size. Unpreparing resource usually means
-  deleting CDI file, so it should be quick as well.
-
-* The complexity is lower than in the volume manager
-  because there is only one global operation needed (prepare vs.
-  attach + publish for each pod).
-
-* Reconstruction after a kubelet restart is simpler (call
-  NodePrepareResource again vs. trying to determine whether
-  volumes are mounted).
-
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
 No.
@@ -1055,93 +1033,103 @@ A third-party resource driver is required for allocating resources.
 
 ###### Will enabling / using this feature result in any new API calls?
 
-For Pods not using ResourceClaims, not much changes. kube-controller-manager,
-kube-scheduler and kubelet will have additional watches for ResourceClaim and
-ResourceClass, but if the feature isn't used, those watches
-will not cause much overhead.
-
-If the feature is used, ResourceClaim will be modified during Pod scheduling,
-startup and teardown by kube-scheduler, the third-party resource driver and
-kubelet. Once a ResourceClaim is allocated and the Pod runs, there will be no
-periodic API calls. How much this impacts performance of the apiserver
-therefore mostly depends on how often this feature is used for new
-ResourceClaims and Pods. Because it is meant for long-running applications, the
-impact should not be too high.
+<!--
+Describe them, providing:
+  - API call type (e.g. PATCH pods)
+  - estimated throughput
+  - originating component(s) (e.g. Kubelet, Feature-X-controller)
+Focusing mostly on:
+  - components listing and/or watching resources they didn't before
+  - API calls that may be triggered by changes of some Kubernetes resources
+    (e.g. update of object X triggers new updates of object Y)
+  - periodic API calls to reconcile state (e.g. periodic fetching state,
+    heartbeats, leader election, etc.)
+-->
 
 ###### Will enabling / using this feature result in introducing new API types?
 
-For ResourceClass, only a few (something like 10 to 20)
-objects per cluster are expected. Admins need to create those.
-
-The number of ResourceClaim objects depends on how much the feature is
-used. They are namespaced and get created directly or indirectly by users. In
-the most extreme case, there will be one or more ResourceClaim for each Pod.
-But that seems unlikely for the intended use cases.
-
-Kubernetes itself will not impose specific limitations for the number of these
-objects.
+<!--
+Describe them, providing:
+  - API type
+  - Supported number of objects per cluster
+  - Supported number of objects per namespace (for namespace-scoped objects)
+-->
 
 ###### Will enabling / using this feature result in any new calls to the cloud provider?
 
 Only if the third-party resource driver uses features of the cloud provider.
 
 ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
 
-The PodSpec potentially changes and thus all objects where it is embedded as
-template. Merely enabling the feature does not change the size, only using it
-does.
-
-In the simple case, a Pod references existing ResourceClaims by name, which
-will add some short strings to the PodSpec and to the ContainerSpec. Embedding
-a ResourceClaimTemplate will increase the size more, but that will depend on
-the number of custom parameters supported by a resource driver and thus is hard to
-predict.
-
-The ResourceClaim objects will initially be fairly small. However, if delayed
-allocation is used, then the list of node names or NodeSelector instances
-inside it might become rather large and in the worst case will scale with the
-number of nodes in the cluster.
+<!--
+Describe them, providing:
+  - API type(s):
+  - Estimated increase in size: (e.g., new annotation of size 32B)
+  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
+-->
 
 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
 
-Startup latency of schedulable stateless pods may be affected by enabling the
-feature because some CPU cycles are needed for each Pod to determine whether it
-uses ResourceClaims.
+<!--
+Look at the [existing SLIs/SLOs].
 
-Actively using the feature will increase load on the apiserver, so latency of
-API calls may get affected.
+Think about adding additional work or introducing new steps in between
+(e.g. need to do X to start a container), etc. Please describe the details.
+
+[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
+-->
 
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
-Merely enabling the feature is not expected to increase resource usage much.
+<!--
+Things to keep in mind include: additional in-memory state, additional
+non-trivial computations, excessive access to disks (including increased log
+volume), significant amount of data sent and/or received over network, etc.
+This through this both in small and large cases, again with respect to the
+[supported limits].
+
+[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
+-->
 
-How much using it will increase resource usage depends on the usage patterns
-and is hard to predict.
+###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
+
+<!--
+Focus not just on happy cases, but primarily on more pathological cases
+(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
+If any of the resources can be exhausted, how this is mitigated with the existing limits
+(e.g. pods per node) or new limits added by this KEP?
+
+Are there any tests that were run/should be run to understand performance characteristics better
+and validate the declared limits?
+-->
 
 ### Troubleshooting
 
+<!--
+This section must be completed when targeting beta to a release.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+
+The Troubleshooting section currently serves the `Playbook` role. We may consider
+splitting it into a dedicated `Playbook` document (potentially with some monitoring
+details). For now, we leave it here.
+-->
+
 ###### How does this feature react if the API server and/or etcd is unavailable?
 
 The Kubernetes control plane will be down, so no new Pods get
-scheduled. kubelet may still be able to start or or restart containers if it
-already received all the relevant updates (Pod, ResourceClaim, etc.).
+scheduled.
 
 ###### What are other known failure modes?
 
-- DRA driver does not or cannot allocate a resource claim.
+- DRA driver controller does not or cannot allocate a resource claim.
 
   - Detection: The primary mechanism is through vendors-provided monitoring for
     their driver. That monitor needs to include health of the driver, availability
     of the underlying resource, etc. The common helper code for DRA drivers
     posts events for a ResourceClaim when an allocation attempt fails.
 
-    When pods fail to get scheduled, kube-scheduler reports that through events
-    and pod status. For DRA, that includes "waiting for resource driver to
-    provide information" (node not selected yet) and "waiting for resource
-    driver to allocate resource" (node has been selected). The
-    ["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/9fca4ec44afad4775c877971036b436eef1a1759/pkg/scheduler/metrics/metrics.go#L200-L206)
-    metric will have pods counted under the "dynamicresources" plugin label.
-
     To troubleshoot, "kubectl describe" can be used on (in this order) Pod,
     ResourceClaim, PodSchedulingContext.
 
@@ -1157,50 +1145,6 @@ already received all the relevant updates (Pod, ResourceClaim, etc.).
     resources in one driver and then failing to allocate the remaining
     resources in another driver (the "need to deallocate" fallback).
 
-- A Pod gets scheduled without allocating resources.
-
-  - Detection: The Pod either fails to start (when kubelet has DRA
-    enabled) or gets started without the resources (when kubelet doesn't
-    have DRA enabled), which then will fail in an application specific
-    way.
-
-  - Mitigations: DRA must get enabled properly in kubelet and kube-controller-manager.
-    Then kube-controller-manager will try to allocate and reserve resources for
-    already scheduled pods. To prevent this from happening for new pods, DRA
-    must get enabled in kube-scheduler.
-
-  - Diagnostics: kubelet will log pods without allocated resources as errors
-    and emit events for them.
-
-  - Testing: An E2E test covers the expected behavior of kubelet and
-    kube-controller-manager by creating a pod with `spec.nodeName` already set.
-
-- A DRA driver kubelet plugin fails to prepare resources.
-
-  - Detection: The Pod fails to start after being scheduled.
-
-  - Mitigations: This depends on the specific DRA driver and has to be documented
-    by vendors.
-
-  - Diagnostics: kubelet will log pods with such errors and emit events for them.
-
-  - Testing: An E2E test covers the expected retry mechanism in kubelet when
-    `NodePrepareResources` fails intermittently.
-
-
-<!--
-For each of them, fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without logging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debug the issue?
-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
--->
-
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
 Performance depends on a large extend on how individual DRA drivers are