Skip to content

Commit

Permalink
DRA: PRR and API for beta of structured parameters in 1.32
Browse files Browse the repository at this point in the history
Much of the PRR text that was originally written for "classic DRA" applies also
to "structured parameters". It gets moved from kubernetes#3063 to kubernetes#4381, with some minor
adaptions. The placeholder comments get restored in kubernetes#3063 because further work
on the KEP would be needed to move it forward - if it gets moved forward at all
instead of being abandoned.

The v1beta1 API will be almost identical to the v1alpha3 API, with just some
minor tweaks to fix oversights.

The kubelet gRPC gets bumped with no changes. Nonetheless, drivers should get
updated, which can be done by updating the Go dependencies and optionally
changing the API import.
  • Loading branch information
pohly committed Sep 24, 2024
1 parent a5ecee1 commit 3b937b9
Show file tree
Hide file tree
Showing 3 changed files with 360 additions and 324 deletions.
260 changes: 102 additions & 158 deletions keps/sig-node/3063-dynamic-resource-allocation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -952,33 +952,36 @@ when the feature is disabled.

### Rollout, Upgrade and Rollback Planning

###### How can a rollout fail? Can it impact already running workloads?
<!--
This section must be completed when targeting beta to a release.
-->

Workloads not using ResourceClaims should not be impacted because the new code
will not do anything besides checking the Pod for ResourceClaims.
###### How can a rollout fail? Can it impact already running workloads?

When kube-controller-manager fails to create ResourceClaims from
ResourceClaimTemplates, those Pods will not get scheduled. Bugs in
kube-scheduler might lead to not scheduling Pods that could run or worse,
schedule Pods that should not run. Those then will get stuck on a node where
kubelet will refuse to start them. None of these scenarios affect already
running workloads.
<!--
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Failures in kubelet might affect running workloads, but only if containers for
those workloads need to be restarted.
Be sure to consider highly-available clusters, where, for example,
feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->

###### What specific metrics should inform a rollback?


One indicator are unexpected restarts of the cluster control plane
components. Another are an increase in the number of pods that fail to
start. In both cases further analysis of logs and pod events is needed to
determine whether errors are related to this feature.
<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

This will be done manually before transition to beta by bringing up a KinD
cluster with kubeadm and changing the feature gate for individual components.
<!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Expand All @@ -988,57 +991,32 @@ No.

###### How can an operator determine if the feature is in use by workloads?

There will be pods which have a non-empty PodSpec.ResourceClaims field and ResourceClaim objects.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
There will be ResourceClaim objects with `spec.controllerName` set.

For kube-controller-manager, metrics similar to the generic ephemeral volume
controller [were added](https://github.com/kubernetes/kubernetes/blob/163553bbe0a6746e7719380e187085cf5441dfde/pkg/controller/resourceclaim/metrics/metrics.go#L32-L47):
###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

- [X] Metrics
- Metric name: `resource_controller_create_total`
- Metric name: `resource_controller_create_failures_total`
- Metric name: `workqueue` with `name="resource_claim"`
<!--
This is your opportunity to define what "normal" quality of service looks like
for a feature.
It's impossible to provide comprehensive guidance, but at the very
high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors <= 1%
- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job <= 10%
- 99.9% of /health requests per day finish with 200 code
These goals will help you determine what you need to measure (SLIs) in the next
question.
-->

For kube-scheduler and kubelet, existing metrics for handling Pods already
cover most aspects. For example, in the scheduler the
["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/6f5fa2eb2f4dc731243b00f7e781e95589b5621f/pkg/scheduler/metrics/metrics.go#L200-L206)
metric will call out pods that are currently unschedulable because of the
`DynamicResources` plugin.
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

For the communication between scheduler and controller, the apiserver metrics
about API calls (e.g. `request_total`, `request_duration_seconds`) for the
`podschedulingcontexts` and `resourceclaims` resources provide insights into
the amount of requests and how long they are taking.

###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

For Pods not using ResourceClaims, the same SLOs apply as before.

For kube-controller-manager, metrics for the new controller could be checked to
ensure that work items do not remain in the queue for too long, for some
definition of "too long".

Pod scheduling and startup are more important. However, expected performance
will depend on how resources are used (for example, how often new Pods are
created), therefore it is impossible to predict what reasonable SLOs might be.

The resource manager component will do its work similarly to the
existing volume manager, but the overhead and complexity should
be lower:

* Resource preparation should be fairly quick as in most cases it simply
creates CDI file 1-3 Kb in size. Unpreparing resource usually means
deleting CDI file, so it should be quick as well.

* The complexity is lower than in the volume manager
because there is only one global operation needed (prepare vs.
attach + publish for each pod).

* Reconstruction after a kubelet restart is simpler (call
NodePrepareResource again vs. trying to determine whether
volumes are mounted).

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

No.
Expand All @@ -1055,93 +1033,103 @@ A third-party resource driver is required for allocating resources.

###### Will enabling / using this feature result in any new API calls?

For Pods not using ResourceClaims, not much changes. kube-controller-manager,
kube-scheduler and kubelet will have additional watches for ResourceClaim and
ResourceClass, but if the feature isn't used, those watches
will not cause much overhead.

If the feature is used, ResourceClaim will be modified during Pod scheduling,
startup and teardown by kube-scheduler, the third-party resource driver and
kubelet. Once a ResourceClaim is allocated and the Pod runs, there will be no
periodic API calls. How much this impacts performance of the apiserver
therefore mostly depends on how often this feature is used for new
ResourceClaims and Pods. Because it is meant for long-running applications, the
impact should not be too high.
<!--
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
Focusing mostly on:
- components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes resources
(e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->

###### Will enabling / using this feature result in introducing new API types?

For ResourceClass, only a few (something like 10 to 20)
objects per cluster are expected. Admins need to create those.

The number of ResourceClaim objects depends on how much the feature is
used. They are namespaced and get created directly or indirectly by users. In
the most extreme case, there will be one or more ResourceClaim for each Pod.
But that seems unlikely for the intended use cases.

Kubernetes itself will not impose specific limitations for the number of these
objects.
<!--
Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->

###### Will enabling / using this feature result in any new calls to the cloud provider?

Only if the third-party resource driver uses features of the cloud provider.

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

The PodSpec potentially changes and thus all objects where it is embedded as
template. Merely enabling the feature does not change the size, only using it
does.

In the simple case, a Pod references existing ResourceClaims by name, which
will add some short strings to the PodSpec and to the ContainerSpec. Embedding
a ResourceClaimTemplate will increase the size more, but that will depend on
the number of custom parameters supported by a resource driver and thus is hard to
predict.

The ResourceClaim objects will initially be fairly small. However, if delayed
allocation is used, then the list of node names or NodeSelector instances
inside it might become rather large and in the worst case will scale with the
number of nodes in the cluster.
<!--
Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Startup latency of schedulable stateless pods may be affected by enabling the
feature because some CPU cycles are needed for each Pod to determine whether it
uses ResourceClaims.
<!--
Look at the [existing SLIs/SLOs].
Actively using the feature will increase load on the apiserver, so latency of
API calls may get affected.
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Merely enabling the feature is not expected to increase resource usage much.
<!--
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->

How much using it will increase resource usage depends on the usage patterns
and is hard to predict.
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

<!--
Focus not just on happy cases, but primarily on more pathological cases
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
If any of the resources can be exhausted, how this is mitigated with the existing limits
(e.g. pods per node) or new limits added by this KEP?
Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->

### Troubleshooting

<!--
This section must be completed when targeting beta to a release.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.
-->

###### How does this feature react if the API server and/or etcd is unavailable?

The Kubernetes control plane will be down, so no new Pods get
scheduled. kubelet may still be able to start or or restart containers if it
already received all the relevant updates (Pod, ResourceClaim, etc.).
scheduled.

###### What are other known failure modes?

- DRA driver does not or cannot allocate a resource claim.
- DRA driver controller does not or cannot allocate a resource claim.

- Detection: The primary mechanism is through vendors-provided monitoring for
their driver. That monitor needs to include health of the driver, availability
of the underlying resource, etc. The common helper code for DRA drivers
posts events for a ResourceClaim when an allocation attempt fails.

When pods fail to get scheduled, kube-scheduler reports that through events
and pod status. For DRA, that includes "waiting for resource driver to
provide information" (node not selected yet) and "waiting for resource
driver to allocate resource" (node has been selected). The
["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/9fca4ec44afad4775c877971036b436eef1a1759/pkg/scheduler/metrics/metrics.go#L200-L206)
metric will have pods counted under the "dynamicresources" plugin label.

To troubleshoot, "kubectl describe" can be used on (in this order) Pod,
ResourceClaim, PodSchedulingContext.

Expand All @@ -1157,50 +1145,6 @@ already received all the relevant updates (Pod, ResourceClaim, etc.).
resources in one driver and then failing to allocate the remaining
resources in another driver (the "need to deallocate" fallback).

- A Pod gets scheduled without allocating resources.

- Detection: The Pod either fails to start (when kubelet has DRA
enabled) or gets started without the resources (when kubelet doesn't
have DRA enabled), which then will fail in an application specific
way.

- Mitigations: DRA must get enabled properly in kubelet and kube-controller-manager.
Then kube-controller-manager will try to allocate and reserve resources for
already scheduled pods. To prevent this from happening for new pods, DRA
must get enabled in kube-scheduler.

- Diagnostics: kubelet will log pods without allocated resources as errors
and emit events for them.

- Testing: An E2E test covers the expected behavior of kubelet and
kube-controller-manager by creating a pod with `spec.nodeName` already set.

- A DRA driver kubelet plugin fails to prepare resources.

- Detection: The Pod fails to start after being scheduled.

- Mitigations: This depends on the specific DRA driver and has to be documented
by vendors.

- Diagnostics: kubelet will log pods with such errors and emit events for them.

- Testing: An E2E test covers the expected retry mechanism in kubelet when
`NodePrepareResources` fails intermittently.


<!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->

###### What steps should be taken if SLOs are not being met to determine the problem?

Performance depends on a large extend on how individual DRA drivers are
Expand Down
Loading

0 comments on commit 3b937b9

Please sign in to comment.