diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md index d1e090027cf..0696b4d0e1a 100644 --- a/keps/sig-node/2400-node-swap/README.md +++ b/keps/sig-node/2400-node-swap/README.md @@ -9,7 +9,13 @@ - [Non-Goals](#non-goals) - [Proposal](#proposal) - [Enable Swap Support only for Burstable QoS Pods](#enable-swap-support-only-for-burstable-qos-pods) - - [Set Aside Swap for System Critical Daemon](#set-aside-swap-for-system-critical-daemon) + - [Set Aside Swap for System Critical Daemons](#set-aside-swap-for-system-critical-daemons) + - [Best Practices](#best-practices) + - [Disable swap for system critical daemons](#disable-swap-for-system-critical-daemons) + - [Protect system critical daemons for iolatency](#protect-system-critical-daemons-for-iolatency) + - [Control Plane Swap](#control-plane-swap) + - [Use of a dedicated disk for swap](#use-of-a-dedicated-disk-for-swap) + - [Swap as the default](#swap-as-the-default) - [Steps to Calculate Swap Limit](#steps-to-calculate-swap-limit) - [Example](#example) - [User Stories](#user-stories) @@ -21,12 +27,16 @@ - [Virtualization management overhead](#virtualization-management-overhead) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) + - [Existing use cases of Swap](#existing-use-cases-of-swap) + - [Exhausting swap resource](#exhausting-swap-resource) - [Security risk](#security-risk) + - [Cgroupv1 support](#cgroupv1-support) - [Design Details](#design-details) - [Enabling swap as an end user](#enabling-swap-as-an-end-user) - [API Changes](#api-changes) - [KubeConfig addition](#kubeconfig-addition) - [CRI Changes](#cri-changes) + - [Swap Metrics](#swap-metrics) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -73,16 +83,16 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) -- [ ] (R) Graduation criteria is in place -- [ ] (R) Production readiness review completed -- [ ] (R) Production readiness review approved +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) +- [x] (R) Graduation criteria is in place +- [x] (R) Production readiness review completed +- [x] (R) Production readiness review approved - [ ] "Implementation History" section is up-to-date for milestone -- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] -- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes +- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes -No new tests added for Alpha and Alpha2 releases. +NA. + +These tasks require e2e test setup so we did not add any integration tests for this. ##### e2e tests @@ -544,8 +649,8 @@ For alpha: Test grid tabs enabled: - [kubelet-gce-e2e-swap-ubuntu](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu): Green - [kubelet-gce-e2e-swap-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial): Green -- [kubelet-gce-e2e-swap-fedora](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora): Degraded -- [kubelet-gce-e2e-swap-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora-serial): Degraded +- [kubelet-gce-e2e-swap-fedora](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora): Green +- [kubelet-gce-e2e-swap-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora-serial): Green No new e2e tests introduced. @@ -566,7 +671,7 @@ For beta 1: #### Alpha - Kubelet can be started with swap enabled and will support two configurations - for Kubernetes workloads: `LimitedSwap` and `UnlimitedSwap`. + for Kubernetes workloads: `LimitedSwap` and `NoSwap`. - Kubelet can configure CRI to allocate swap to Kubernetes workloads. By default, workloads will not be allocated any swap. - e2e test jobs are configured for Linux systems with swap enabled. @@ -599,7 +704,8 @@ Here are specific improvements to be made: - Investigate eviction behavior with swap enabled. #### Beta 1 -- Enable Swap Support using Burstable QoS Pods only. + +- Enable Swap Support using Burstable QoS Pods only. - Enable Swap Support for Cgroup v2 Only. - Add swap memory to the Kubelet stats api. - Determine a set of metrics for node QoS in order to evaluate the performance @@ -608,12 +714,18 @@ Here are specific improvements to be made: - Improve coverage for appropriate scenarios in testgrid. #### Beta 2 -- Publish a Kubernetes doc page encoring user to use encrypted swap if they wish to enable this feature. + +- Publish a Kubernetes doc page encouraging users to use encrypted swap if they wish to enable this feature. - Add [swap specific tests](https://github.com/kubernetes/kubernetes/issues/120798) such as, handling the usage of swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what container runtime will do at (container cgroup boundary). - Fix flaking/failing swap node e2e jobs. - Address eviction related [issue](https://github.com/kubernetes/kubernetes/issues/120800) in swap implementation. +- Add `NoSwap` as the default setting. +- Remove `UnlimitedSwap` as a supported option. +- Add e2e test confirming that `NoSwap` will actually not swap +- Add e2e test confirming that swap is used for `LimitedSwap`. +- Document [best practices](#best-practices) for setting up Kubernetes with swap [via cgroups]: #restrict-swap-usage-at-the-cgroup-level @@ -623,8 +735,7 @@ _(Tentative.)_ - Test a wide variety of scenarios that may be affected by swap support. - Remove feature flag. -- Remove the Swap Support using Burstable QoS Pods only deprecated in Beta 2. - +- Remove the Swap Support using Burstable QoS Pods only deprecated in Beta 2. ### Upgrade / Downgrade Strategy @@ -735,15 +846,22 @@ feature, can it break the existing applications?). NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. --> -No. The feature flag can be disabled while the `--fail-swap-on=false` flag is -set, but this would result in undefined behaviour. - To turn this off, the kubelet would need to be restarted. If a cluster admin wants to disable swap on the node without repartitioning the node, they could stop the kubelet, set `swapoff` on the node, and restart the kubelet with `--fail-swap-on=true`. The setting of the feature flag will be ignored in this case. +In Beta2, we realize that we cannot rely on `--fail-swap-on=false` +as a flag for this feature. The flag predates this feature and it has +been used over time. We propose a configuration in `MemorySwap` called `NoSwap`. +Users could also set `NoSwap` in `MemorySwap` to disable all workloads from +using swap without requiring the user to disable swap if that is needed. + +In Beta releases of this feature, one could use turn off `NodeSwap` feature toggle +but once this feature is GA, users could use another option to disable swap +for workloads. + ###### What happens if we reenable the feature if it was previously rolled back? N/A @@ -824,6 +942,8 @@ This section must be completed when targeting beta to a release. ###### How can someone using this feature know that it is working for their instance? +See #swap-metrics + 1. Kubelet stats API will be extended to show swap usage details. ###### How can an operator determine if the feature is in use by workloads? @@ -882,7 +1002,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co implementation difficulties, etc.). --> -N/A +We added metrics to the node stats to report how much swap is used +and the capacity of swap. ### Dependencies @@ -986,9 +1107,14 @@ Think about adding additional work or introducing new steps in between [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos --> -It is possible for this feature to affect performance of some worker node-level -SLIs/SLOs. We will need to monitor for differences, particularly during beta -testing, when evaluating this feature for beta and graduation. +Yes, enabling swap can affect performance of other critical daemons on the system. +Any scenario where swap memory gets utilized is a result of system running out of physical RAM. +Hence, to maintain the SLIs/SLOs of critical daemons on the node we highly recommend to disable the swap for the system.slice +along with reserving adequate enough system reserved memory. + +The SLI that could potentially be impacted is [pod startup latency](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md). +If the container runtime or kubelet are performing slower than expected, pod startup latency would be impacted. +In addition to this SLI, general areas around pod lifecycle (image pulls, sandbox creation, storage) could become slow. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? @@ -1058,7 +1184,10 @@ nodes that do not use swap memory. - **2017-10-06:** Discussed in [#53533](https://github.com/kubernetes/kubernetes/issues/53533). - **2021-01-05:** Initial design discussion document for swap support and use cases. - **2021-04-05:** Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400). -- **2021-08-09:** New in Kubernetes v1.22: alpha support for using swap memory: https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/ +- **2021-08-09:** New in Kubernetes v1.22: alpha support for using swap memory: https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/. +- **2023-04-17:** KEP update for beta1 [#3957](https://github.com/kubernetes/enhancements/pull/3957). +- **2023-08-15:** Beta1 released in kubernetes 1.28 +- **2024-01-12:** Updates to Beta2 KEP. ## Drawbacks @@ -1084,6 +1213,10 @@ This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim. +This is also a breaking change. +Users have used --fail-swap-on=false to allow for kubernetes to run +on a swap enabled node. + ### Restrict swap usage at the cgroup level Setting a swap limit at the cgroup level would allow us to restrict the usage