diff --git a/keps/prod-readiness/sig-node/4112.yaml b/keps/prod-readiness/sig-node/4112.yaml new file mode 100644 index 00000000000..37b677a15cd --- /dev/null +++ b/keps/prod-readiness/sig-node/4112.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 4112 +alpha: + approver: "@johnbelamaric" diff --git a/keps/sig-node/4112-passdown-resources-to-cri/README.md b/keps/sig-node/4112-passdown-resources-to-cri/README.md new file mode 100644 index 00000000000..502cc991d64 --- /dev/null +++ b/keps/sig-node/4112-passdown-resources-to-cri/README.md @@ -0,0 +1,1297 @@ + +# [KEP-4112](https://github.com/kubernetes/enhancements/issues/4112): Pass down resources to CRI + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [CRI API](#cri-api) + - [PodSandboxConfig](#podsandboxconfig) + - [CreateContainer](#createcontainer) + - [UpdateContainerResourcesRequest](#updatecontainerresourcesrequest) + - [UpdatePodSandboxResources](#updatepodsandboxresources) + - [kubelet](#kubelet) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Container annotations](#container-annotations) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +The CRI runtime lacks visibility to the application resource requirements. + +First, the resources required by the containers of a pod are not visible at the +pod sandbox creation time. This can be problematic for example in the case of +VM-based runtimes where all resources need to be reserved/prepared when the VM +(i.e. sandbox) is being created. + +Second, the kubelet does not provide complete information about the container +resources specification of native and extended resources (requests and limits) +to CRI. However, various use cases have been identified where detailed +knowledge of all the resources can be utilized in container runtimes for more +optimal resource allocation to improve application performance and reduce +cross-application interference. + +This KEP proposes CRI API extensions for providing complete view of pods +resources at sandbox creation, and, providing unobfuscated information about +the resource requests and limits to container runtimes. + +## Motivation + +When the pod sandbox is created, the kubelet does not provide the CRI runtime +any information about the resources (such as native resources, host devices, +mounts, CDI devices etc) that will be required by the application. The CRI +runtime only becomes aware of the resources piece by piece when containers of +the pod are created (one-by-one). + +This can cause issues with VM-based runtimes +(e.g. [Kata containers](https://katacontainers.io/) and [Confidential Containers](https://www.cncf.io/projects/confidential-containers/)) that need to prepare the VM before containers are created. + +For Kata to handle PCIe devices properly the CRI needs to tell the kata-runtime +how many PCIe root-ports or PCIe switch-ports the hypervisor needs to create at +sandbox creation depending on the number of devices allocated by the containers. +The PCIe root-port is a static configuration and the hypervisor cannot adjust it +once the sandbox is created. During container creation the PCIe devices are +hot-plugged to the PCIe root-port or switch-port. If the number of pre-allocated +pluggable ports is too low, the attachment will fail (container devices > +pre-allocated hot-pluggable ports). + +In the case of Confidential Containers (uses Kata under the hood with additional +software components for attestation) the CRI needs to consider the cold-plug aka +direct attachment use-case. At sandbox creation time the hypervisor needs to +know the exact number of pass-through devices and its properties +(VFIO IOMMU group, the actual VFIO device - there can be several devices in a +IOMMU group, attach to PCIe root-port or PCIe switch-port (PCI-Bridge)). +In a confidential setting a user does not want to reconfigure the VM +(creates an attack-vector) on every create container request. The hypervisor +needs a fully static view of resources needed for VM sizing. + +Independent of hot or cold-plug the hypervisor needs to know how the PCI(e) +topology needs to look like at sandbox creation time. + +Updating resources of a container means also resizing the VM, hence the +hypervisors needs the complete list of resources available at a update container +request. + +Another visibility issue is related to the native and extended resources. +Kubelet manages the native resources (CPU and memory) and communicates resource +parameters over the CRI API to the runtime. The following snippet shows the +currently supported CRI annotations that are provided by the Kubelet to e.g. +`containerd`: + +```sh +pkg/cri/annotations/annotations.go + + // SandboxCPU annotations are based on the initial CPU configuration for the sandbox. This is calculated as the + // sum of container CPU resources, optionally provided by Kubelet (introduced in 1.23) as part of the PodSandboxConfig + SandboxCPUPeriod = "io.kubernetes.cri.sandbox-cpu-period" + SandboxCPUQuota = "io.kubernetes.cri.sandbox-cpu-quota" + SandboxCPUShares = "io.kubernetes.cri.sandbox-cpu-shares" + + // SandboxMemory is the initial amount of memory associated with this sandbox. This is calculated as the sum + // of container memory, optionally provided by Kubelet (introduced in 1.23) as part of the PodSandboxConfig. + SandboxMem = "io.kubernetes.cri.sandbox-memory" +``` + +However, the original details of +the resource spec are lost as they get translated (within kubelet) to +platform-specific (i.e. Linux or Windows) resource controller parameters like +cpu shares, memory limits etc. Non-native resources such as extended resources +and the device plugin resources completely invisible to the CRI runtime. However, +[OCI hooks](https://github.com/opencontainers/runtime-spec/blob/master/config.md), +[runC](https://github.com/opencontainers/runc) wrappers, +[NRI](https://github.com/containerd/nri) plugins or in some cases even +applications themselves would benefit on seeing the original resource requests +and limits e.g. for doing customized resource optimization. + +Extending the CRI API to communicate all resources already at sandbox creation +and pass down resource requests and limits (of native and extended resources) +would provide a comprehensive and early-enough view of the resource usage of +all containers of the pod, allowing improved resource allocation without +breaking any existing use cases. + +### Goals + +- make the information about all required resources (e.g. native and extended + resources, devices, mounts, CDI devices) of a Pod available to the CRI at + sandbox creation time +- make container resource spec transparently visible to CRI (the container + runtime) + +### Non-Goals + +- change kubelet resource management +- change existing behavior of CRI + +## Proposal + +### User Stories + +#### Story 1 + +As a VM-based container runtime developer, I want to allocate/expose enough +RAM, hugepages, hot- or cold-pluggable PCI(e) ports, protected memory sections +and other resources for the VM to ensure that all containers in the pod are +guaranteed to get the resources they require. + +#### Story 2 + +As a developer of non-runc / non-Linux CRI runtime, I want to know detailed +container resource requests to be able to make correct resource allocation for +the applications. I cannot rely on cgroup parameters on this but need to know +what the user requested to fairly allocate resources between applications. + +#### Story 3 + +As a cluster administrator, I want to install an NRI plugin that does +customized resource handling. I run kubelet with CPU manager and memory manager +disabled (CPU manager policy set to `none`). Instead I use my NRI plugin to do +customized resource allocation (e.g. cpu and memory pinning). To do that +properly I need the actual resource requests and limits requested by the user. + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +The proposal only adds new informational data to the CRI API between kubelet +and the container runtime with no user-visible changes which mitigates possible +risks considerably. + +Data duplication/inconsistency with native resources could be considered a risk +as those are passed down to CRI both as "raw" requests and limits and as +"translated" resource control parameters (like cpu shares oom scoring etc). But +this should be largely mitigated by code reviews and unit tests. + +## Design Details + +The proposal is that kubelet discloses full resources information from the +PodSpec to the container runtime. This is accomplished by extending the +ContainerConfig, UpdateContainerResourcesRequest and PodSandboxConfig messages +of the CRI API. + +With this information, the runtime can for example do detailed resource +allocation so that CPU, memory and other resources for each container are +optimally aligned. This applies to scenarios where the kubelet CPU manager is +disabled (by using the `none` CPU manager policy). + +The resource information is included in PodSandboxConfig so that the runtime +can see the full picture of Pod's resource usage at Pod creation time, for +example enabling more holistic resource allocation and thus better +interoperability between containers inside the Pod. + +Also the CreateContainer request is extended to include the unmodified resource +requirements. This make it possible for the CRI runtime to detect any changes +in the pod resources that happen between the Pod creation and container +creation in e.g. scenarios where in-place pod updates are involved. + +[KEP-1287][kep-1287] ([Issue][kep-1287-issue]) Beta in Kubernetes v1.32 +introduced UpdatePodSandboxResources rpc to the CRI API. The +UpdatePodSandboxResources CRI message is updated to include the resource +information of all containers (aligning with UpdateContainerResourcesRequest). + +[KEP-2837][kep-2837] ([Issue][kep-2837-issue]) Alpha in Kubernetes v1.32 +introduced Pod-level resource requirements field to the PodSpec. The +PodResourceConfig message in the CRI API is updated to include the Pod-level +resource requirements. + +### CRI API + +#### PodSandboxConfig + +The PodSandboxConfig message (part of the RunPodSandbox request) will be +extended to contain information about resources of all its containers known at +the pod creation time. The container runtime may use this information to make +preparations for all upcoming containers of the pod. E.g. setup all needed +resources for a VM-based pod or prepare for optimal allocation of resources of +all the containers of the Pod. However, the container runtime may continue to +operate as they did (before this enhancement). That is, it can ignore +the resource information presented here and allocate resources for each +container separately at container creation time with the `CreateContainer` +request. + +The Pod-level resources enhancement [KEP-2837][kep-2837] +([Issue][kep-2837-issue]) Alpha in Kubernetes v1.32 added new Pod-level +resource requirements field to the PodSpec. This information will is included +in the PodResourceConfig message, similar to the container-level resource +information. + +```diff + message PodSandboxConfig { + + ... + + // Optional configurations specific to Linux hosts. + LinuxPodSandboxConfig linux = 8; + // Optional configurations specific to Windows hosts. + WindowsPodSandboxConfig windows = 9; ++ ++ // Kubernetes resource spec of the containers in the pod. ++ PodResourceConfig pod_resources = 10; + } + ++// PodResourceConfig contains information of all resources requirements of ++// the containers of a pod. ++message PodResourceConfig { ++ // Resource configuration of all containers in the pod. ++ repeated ContainerResourceConfig containers = 1; ++ ++ // Kubernetes resource spec of the pod-level resource requirements. ++ // This is the pod-level resource requirements introduced in KEP-2837 ++ // (alpha in v1.32). ++ KubernetesResources kubernetes_resources = 2; ++} + ++// ContainerResourceConfig contains information of all resource requirements of ++// one container. ++message ContainerResourceConfig { ++ // Name of the container ++ string name= 1; ++ ++ // Type of the container ++ ContainerType type= 2; ++ ++ // Kubernetes resource spec of the container ++ KubernetesResources kubernetes_resources = 3; ++ ++ // Mounts for the container. ++ repeated Mount mounts = 4; ++ ++ // Devices for the container. ++ repeated Device devices = 5; ++ ++ // CDI devices for the container. ++ repeated CDIDevice CDI_devices = 6; ++} + ++enum ContainerType { ++ INIT_CONTAINER = 0; ++ SIDECAR_CONTAINER = 1; ++ CONTAINER = 2; ++} +``` + +#### CreateContainer + +The ContainerConfig message (used in CreateContainer request) is extended to +contain unmodified resource requests from the PodSpec. + +```diff ++import "k8s.io/apimachinery/pkg/api/resource/generated.proto"; + + message ContainerConfig { + + ... + + // Configuration specific to Windows containers. + WindowsContainerConfig windows = 16; + + // CDI devices for the container. + repeated CDIDevice CDI_devices = 17; ++ ++ // Kubernetes resource spec of the container ++ KubernetesResources kubernetes_resources = 18; + } + ++// KubernetesResources contains the resource requests and limits as specified ++// in the Kubernetes core API ResourceRequirements. ++message KubernetesResources { ++ // Requests and limits from the Kubernetes container config. ++ map requests = 1; ++ map limits = 2; ++} +``` + +Note that mounts, devices, CDI devices are part of the ContainerConfig message +but are left out of the diff snippet above. + +Including the KubernetesResources in the ContainerConfig message serves +multiple purposes: + +1. Catch changes that happen between pod sandbox creation and container + creation. For example, in-place pod updates might change the container + before it was created. +2. Catch changes that happen over container restarts in in-place pod update + scenarios +3. Consistency/completeness. Have enough information to make consistent action + based only on information present in this rpc caal. + +The resources (mounts, devices, CDI devices, Kubernetes resources) in the +CreateContainer request should be identical to what was (pre-)informed in the +RunPodSandbox request. If they are different, the CRI runtime may fail the +container creation, for example because changes cannot be applied after a +VM-based Pod has been created. + +#### UpdateContainerResourcesRequest + +The UpdateContainerResourcesRequest message is extended to pass down unmodified +resource requests from the PodSpec. + +```diff + message UpdateContainerResourcesRequest { + // ID of the container to update. + string container_id = 1; + // Resource configuration specific to Linux containers. + LinuxContainerResources linux = 2; + // Resource configuration specific to Windows containers. + WindowsContainerResources windows = 3; + // Unstructured key-value map holding arbitrary additional information for + // container resources updating. This can be used for specifying experimental + // resources to update or other options to use when updating the container. + map annotations = 4; ++ ++ // Kubernetes resource spec of the container ++ KubernetesResources kubernetes_resources = 5; + } +``` + +Note that mounts, devices, CDI devices are not part of the +UpdateContainerResourcesRequest message and this proposal does not suggest +adding them. + +#### UpdatePodSandboxResources + +The In-Place Update of Pod Resources ([KEP-1287][kep-1287]) Beta in Kubernetes +v1.32 introduced new UpdatePodSandboxResources rpc to inform the CRI runtime +about the changes in the pod resources. + +The UpdatePodSandboxResourcesRequest message is extended similarly to the +[PodSandboxConfig](#podsandboxconfig) message to contain information about +resources of all its containers. In UpdatePodSandboxResourcesRequest this will +reflect the updated resource requirements of the containers. + +```diff + message UpdatePodSandboxResourcesRequest { + // ID of the PodSandbox to update. + string pod_sandbox_id = 1; + + // Optional overhead represents the overheads associated with this sandbox + LinuxContainerResources overhead = 2; + // Optional resources represents the sum of container resources for this sandbox + LinuxContainerResources resources = 3; + + // Unstructured key-value map holding arbitrary additional information for + // sandbox resources updating. This can be used for specifying experimental + // resources to update or other options to use when updating the sandbox. + map annotations = 4; ++ ++ // Kubernetes resource spec of the containers in the pod. ++ PodResourceConfig pod_resources = 5; + } +``` + +The implementation will be synced with [KEP-1287][kep-1287]. + +### kubelet + +Kubelet code is refactored/modified so that all container resources are known +before sandbox creation. This mainly consists of preparing all mounts (of all +containers) early. + +Kubelet will be extended to pass down all resources of containers in all +related CRI requests (as described in the [CRI API](#cri-api) section). That +is: + +- adding mounts, devices, CDI devices and the unmodified resource requests and + limits of all containers into RunPodSandbox request +- adding unmodified resource requests and limits into CreateContainer and + UpdateContainerResources requests + +For example, take a PodSpec: + +```yaml +apiVersion: v1 +kind: Pod +... +spec: + containers: + - name: cnt-1 + image: k8s.gcr.io/pause + resources: + requests: + cpu: 1 + memory: 1G + example.com/resource: 1 + limits: + cpu: 2 + memory: 2G + example.com/resource: 1 + volumeMounts: + - mountPath: /my-volume + name: my-volume + - mountPath: /image-volume + name: image-volume + volumes: + - name: my-volume + emptyDir: + - name: image-volume + image: + reference: example.com/registry/artifact:tag +``` + +Then kubelet will send the following RunPodSandboxRequest when creating the Pod +(represented here in yaml format): + +```yaml +RunPodSandboxRequest: + config: + ... + podResources: + containers: + - name: cnt-1 + kubernetes_resources: + requests: + cpu: "1" + memory: 1G + example.com/resource: "1" + limits: + cpu: "2" + memory: 2G + example.com/resource: "1" + CDI_devices: + - name: example.com/resource=CDI-Dev-1 + mounts: + - container_path: /my-volume + host_path: /var/lib/kubelet/pods//volumes/kubernetes.io~empty-dir/my-volume + - container_path: /image-volume + image: + image: example.com/registry/artifact:tag + ... + - container_path: /var/run/secrets/kubernetes.io/serviceaccount + host_path: /var/lib/kubelet/pods//volumes/kubernetes.io~projected/kube-api-access-4srqm + readonly: true + - container_path: /dev/termination-log + host_path: /var/lib/kubelet/pods//containers/cnt-1/ +``` + +Note that all device plugin resources are passed down in the +`kubernetes_resources` field but this does not contain any properties of the +device that was actually allocated for the container. However, these properties +are exposed through the `CDI_devices`, `mounts` and `devices` fields. + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +No prerequisite testing updates have been identified. + +##### Unit tests + + + + + +- `k8s.io/kubernetes/pkg/kubelet/kuberuntime`: `2024-02-02` - `68.3%` + +The +[fake_runtime](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/testing/fake_runtime_service.go) +will be used in unit tests to verify that the Kubelet correctly passes down the +resource information to the CRI runtime. + +##### Integration tests + + + +For alpha, no new integration tests are planned. + +##### e2e tests + + + +For alpha, no new e2e tests are planned. + +For Beta: a suite of NRI tests will be added to verify that the runtime +receives the resource information correctly and passes it down to the NRI +plugins. + +### Graduation Criteria + + + +#### Alpha + +- Feature implemented behind a feature flag +- Initial unit tests completed and enabled + +#### Beta + +- Gather feedback from developers and surveys +- Feature gate enabled by default +- containerd and CRI-O runtimes have released versions that have adopted the + new CRI API changes + +#### GA + +- No bugs reported in the previous cycle + +### Upgrade / Downgrade Strategy + + + +The feature gate (in kubelet) controls the feature enablement. Existing runtime +implementations will continue to work as previously, even if the feature is +enabled. + +### Version Skew Strategy + + + +The feature is node-local (kubelet-only) so there is no dependencies or effects +to other Kubernetes components. + +The behavior is unchanged if either kubelet or the CRI runtime running on a +node does not support the feature. If kubelet has the feature enabled but the +CRI runtime does not support it, the CRI runtime will ignore the new fields in +the CRI API and function as previously. Similarly, if the CRI runtime supports +the feature but the kubelet does not, the runtime will resort to the previous +behavior. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [X] Feature gate + - Feature gate name: KubeletContainerResourcesInPodSandbox + - Components depending on the feature gate: + - kubelet + +###### Does enabling the feature change any default behavior? + + + +Yes. The kubelet will start passing the extra information to the CRI runtime +for every container it creates. Whether this has any effect depends on if the +underlying CRI runtime supports this feature. For example, an NRI plugin +relying on the feature may cause the application to behave differently. + +Long running pods that persist (without restart) over kubelet and CRI runtime +update which enables the feature may experience version skew of the metadata. +After enabling the feature, the CRI runtime does not have the aggregated +information of all resources of the pod, provided with this feature, as the +kubelet didn't restart these pods (didn't send the CreatePodSandbox CRI +request). This may affect some scenarios e.g. NRI plugins. This "metadata skew" +can be avoided by draining the node before updating the kubelet and the CRI +runtime. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Yes, disabling the `KubeletContainerResourcesInPodSandbox` feature gate will +disable the feature. Restarting pods may be needed to reset the information +that was passed down to the CRI. + +###### What happens if we reenable the feature if it was previously rolled back? + +New pods will have the feature enabled. Existing pods will continue to operate +as before until restarted. + +###### Are there any tests for feature enablement/disablement? + + + +Unit tests for the feature gate will be added. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +Rollback or rollout in the kubelet should not fail - it only enables/disabled +the information (fields in the CRI message) passed down to the CRI runtime. + +However, if the CRI runtime depends on the feature, a rollout or rollback may +cause failures of applications on pod restarts. Running pods are not affected. + +###### What specific metrics should inform a rollback? + + + +Alpha: No new metrics are planned. Increase in the existing +`kubelet_started_pods_errors_total` metric can indicate a problem caused by +this feature. + +Generally, non-ready pods with CreatePodSandboxError status (reflected by the +`kubelet_started_pods_errors_total` metric) is a possible indicator. The error +message will contain details if the CRI failure is related to the feature. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +Alpha: Manual testing of the feature gate is performed. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +No. + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +By examing the kubelet feature gate and the version of the CRI runtime. The +enablement of the kubelet feature gate can be determined from the +`kubernetes_feature_enabled` metric. + +###### How can someone using this feature know that it is working for their instance? + + + +The end users do not see the status of the feature directly. + +The cluster operator can verify that the feature is working by examining the +kubelet and CRI runtime logs. + +The CRI runtime or NRI plugin developers depending on the feature can ensure +that it is working by verifying that all the required information is available +at pod sandbox creation time. + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +No increase in the `kubelet_started_pods_errors_total` rate. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [x] Metrics + - Metric name: `kubelet_started_pods_errors_total` + - Components exposing the metric: kubelet + +> NOTE: The `kubelet_started_pods_errors_total` metric is a general metric for +> any errors that occur when starting pods. The error message (Pod events, +> kubelet logs) will contain details if the CRI failure is related to the +> feature. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +N/A. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +No. + +However, the practical usability of this feature requires that also the CRI +runtime supports it. The feature is effectively a no-op if the CRI runtime does +not support it. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +No. + +###### Will enabling / using this feature result in introducing new API types? + + + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +Not noticeably. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +No. The new data fields in the CRI API would not count as significant increase. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +No. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +N/A. The feature is node-local. + +###### What are other known failure modes? + + + +The feature in Kubernetes is relatively straightforward - passing extra +information to the CRI runtime. The failure scenarios arise in the CRI runtime +level, e.g.: + +- misbehaving CRI runtime or NRI plugin +- CRI runtime or NRI plugin is depending on the feature but it is not enabled + in the kubelet +- configuration skew in the cluster where some nodes have the feature enabled + and some do not + +Pod events and CRI runtime logs are the primary sources of information for +these failure scenarios. + +###### What steps should be taken if SLOs are not being met to determine the problem? + +N/A. + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +### Container annotations + +Container annotations could be used as an alternative way to pass down the +resource requests and limits to the container runtime. + +## Infrastructure Needed (Optional) + + + + + +[kep-1287]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources +[kep-1287-issue]: https://github.com/kubernetes/enhancements/issues/1287 +[kep-2837]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2837-pod-level-resource-spec +[kep-2837-issue]: https://github.com/kubernetes/enhancements/issues/2837 diff --git a/keps/sig-node/4112-passdown-resources-to-cri/kep.yaml b/keps/sig-node/4112-passdown-resources-to-cri/kep.yaml new file mode 100644 index 00000000000..8f04f0eb00d --- /dev/null +++ b/keps/sig-node/4112-passdown-resources-to-cri/kep.yaml @@ -0,0 +1,39 @@ +title: Pass down resources to CRI +kep-number: 4112 +authors: + - "@marquiz" + - "@askervin" +owning-sig: sig-node +participating-sigs: [] +status: implementable +creation-date: 2023-06-28 +reviewers: + - "@mikebrow" +approvers: + - "@haircommander" + +see-also: [] +replaces: [] + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.33" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: KubeletContainerResourcesInPodSandbox + components: + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: []