Skip to content

Commit e55351e

Browse files
harcheiholder101
authored andcommitted
Graduate swap to Beta 1
Signed-off-by: Itamar Holder <[email protected]>
1 parent 7f4ad2f commit e55351e

File tree

2 files changed

+114
-15
lines changed

2 files changed

+114
-15
lines changed

keps/sig-node/2400-node-swap/README.md

Lines changed: 110 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,11 @@
88
- [Goals](#goals)
99
- [Non-Goals](#non-goals)
1010
- [Proposal](#proposal)
11+
- [Enable Swap Support only for Burstable QoS Pods](#enable-swap-support-only-for-burstable-qos-pods)
12+
- [Set Aside Swap for System Critical Daemon](#set-aside-swap-for-system-critical-daemon)
13+
- [Use Swap Throttling](#use-swap-throttling)
14+
- [Steps to Calculate Swap Limit](#steps-to-calculate-swap-limit)
15+
- [Example](#example)
1116
- [User Stories](#user-stories)
1217
- [Improved Node Stability](#improved-node-stability)
1318
- [Long-running applications that swap out startup memory](#long-running-applications-that-swap-out-startup-memory)
@@ -17,6 +22,7 @@
1722
- [Virtualization management overhead](#virtualization-management-overhead)
1823
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
1924
- [Risks and Mitigations](#risks-and-mitigations)
25+
- [Security risk](#security-risk)
2026
- [Design Details](#design-details)
2127
- [Enabling swap as an end user](#enabling-swap-as-an-end-user)
2228
- [API Changes](#api-changes)
@@ -30,7 +36,8 @@
3036
- [Graduation Criteria](#graduation-criteria)
3137
- [Alpha](#alpha)
3238
- [Alpha2](#alpha2)
33-
- [Beta](#beta)
39+
- [Beta 1](#beta-1)
40+
- [Beta 2](#beta-2)
3441
- [GA](#ga)
3542
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
3643
- [Version Skew Strategy](#version-skew-strategy)
@@ -166,6 +173,83 @@ administrators can configure the kubelet such that:
166173

167174
This proposal enables scenarios 1 and 2 above, but not 3.
168175

176+
### Enable Swap Support only for Burstable QoS Pods
177+
Before enabling swap support through the pod API, it is crucial to build confidence in this feature by carefully assessing its impact on workloads and Kubernetes. As an initial step, we propose enabling swap support for Burstable QoS Pods by automatically calculating the appropriate swap values, rather than allowing users to input these values manually.
178+
179+
Swap access is granted only for pods of Burstable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictable, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them.
180+
181+
By doing so, we can ensure a thorough understanding of the feature's performance and stability before considering the manual input of swap values in a subsequent beta release. This cautious approach will ensure the efficient allocation of resources and the smooth integration of swap support into Kubernetes.
182+
183+
Allocate the swap limit equal to the requested memory for each container and adjust the proportion of swap based on the total swap memory available.
184+
185+
#### Set Aside Swap for System Critical Daemon
186+
187+
System critical daemons (such as Kubelet) are essential for node health. Usually, an appropriate portion of system resources (e.g., memory, CPU) is reserved as system reserved. However, swap doesn't inherently support reserving a portion out of the total available. For instance, in the case of memory, we set `memory.min` on the node-level cgroup to ensure an adequate amount of memory is set aside, away from the pods, and for system critical daemons. But there is no equivalent for swap; i.e., no `memory.swap.min` is supported in the kernel.
188+
189+
Since this proposal advocates enabling swap only for the Burstable QoS pods, this can be done by setting `memory.swap.max` on the cgroups used by the Burstable QoS pods. The value of this `memory.swap.max` can be calculated by:
190+
191+
memory.swap.max = total swap memory available on the system - system reserve (memory)
192+
193+
This is the total amount of swap available for all the Burstable QoS pods; let's call it `TotalPodsSwapAvailable`. This will ensure that the system critical daemons will have access to the swap at least equal to the system reserved memory. This will indirectly act as having support for swap in system reserved.
194+
195+
#### Use Swap Throttling
196+
197+
`memory.swap.high` in the cgroup v2 allows you to enable throttling for swap allocation. If a cgroup’s swap usage exceeds `memory.swap.high`, all its further allocations will be throttled.
198+
199+
Throttling will be enabled at individual Burstable QoS pod as well as at Burstable QoS cgroup level.
200+
201+
1. At the Burstable QoS cgroup level, it can be calculated with: memory.swap.high = `TotalPodsSwapAvailable` * swap throttling factor
202+
2. At individual Burstable QoS pod level, it can be calculated with: memory.swap.high = memory.swap.max * swap throttling factor
203+
204+
The default value of the memory throttling factor is set to 0.9.
205+
### Steps to Calculate Swap Limit
206+
207+
1. **Calculate the total memory requests of the pod:**
208+
- Sum up the memory requests of all containers in the pod. Let's call this value `TotalMemory`.
209+
210+
2. **Determine the swap proportion for each container:**
211+
- For each container, divide its memory request by the `TotalMemory`. The result will be the proportion of requested memory for that container within the pod. Let's call this value `RequestedMemoryProportion`.
212+
213+
3. **Calculate the total swap memory available:**
214+
- `TotalPodsSwapAvailable` is the total amount of memory available for the pods. Divide the available swap memory by the total physical memory. Let's call this value `SwapMemoryProportion`.
215+
216+
4. **Calculate the swap limit for each container:**
217+
- Multiply the `RequestedMemoryProportion` of a container by its memory request and then multiply the result by the `SwapMemoryProportion`. The result will be the adjusted swap limit for that specific container within the pod.
218+
219+
5. **Calculate the swap throttling threshold for each container:**
220+
- Multiple the swap throttling threashold (0.9) with the swap limit for each container.
221+
222+
223+
#### Example
224+
Suppose we have a Burstable QoS pod with two containers:
225+
226+
- Container A: Memory request 20 GB
227+
- Container B: Memory request 10 GB
228+
229+
Let's assume the total physical memory is 40 GB and the total swap memory available is also 40 GB. Also assume that the system reserved memory is configured at 2GB,
230+
231+
Step 1: Calculate `TotalPodsSwapAvailable`, which denotes total available swap memory: 40 GB - 2 GB = 38 GB
232+
233+
Step 2: Calculate the total memory requests of the pod: 20 GB + 10 GB = 30 GB
234+
235+
Step 2: Determine the requested memory proportion for each container:
236+
- Container A: (20 GB) / (30 GB) = 2/3
237+
- Container B: (10 GB) / (30 GB) = 1/3
238+
239+
Step 3: Calculate the total swap memory available: Since the total swap memory (38 GB) and physical memory (40 GB), the `SwapMemoryProportion` will be 38 GB / 40 GB = 0.95
240+
241+
Step 4: Calculate the swap limit for each container:
242+
- Container A: (2/3) * 20 GB * 0.95 ≈ 12.66 GB
243+
- Container B: (1/3) * 10 GB * 0.95 ≈ 3.16 GB
244+
245+
Step 5: Calculate the swap throttling threshold for each container:
246+
- Container A: 12.66 * 0.9 = 11.39 GB
247+
- Container B: 3.16 * 0.9 = 2.844 GB
248+
249+
In this example, Container A would have a swap limit of 12.66 GB, and Container B would have a swap limit of 3.16 GB.
250+
251+
This approach allocates swap limits based on each container's memory request and adjusts the proportion based on the total swap memory available in the system. It ensures that each container gets a fair share of the swap space and helps maintain resource allocation efficiency.
252+
169253
### User Stories
170254

171255
#### Improved Node Stability
@@ -300,6 +384,12 @@ and/or workloads in a number of different scenarios.
300384
Since swap provisioning is out of scope of this proposal, this enhancement
301385
poses low risk to Kubernetes clusters that will not enable swap.
302386

387+
#### Security risk
388+
389+
Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system.
390+
391+
Additionally end user may decide to disable swap completely for a Pod in beta 1 by making it guaranteed. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks.
392+
303393
## Design Details
304394

305395
We summarize the implementation plan as following:
@@ -487,14 +577,14 @@ Test grid tabs enabled:
487577

488578
No new e2e tests introduced.
489579

490-
For alpha2 [Current stage]:
580+
For alpha2:
491581

492582
- Add e2e tests that exercise all available swap configurations via the CRI.
493583
- Verify MemoryPressure behavior with swap enabled and document any changes
494584
for configuring eviction.
495585
- Verify new system-reserved settings for swap memory.
496586

497-
For beta [Future]:
587+
For beta 1:
498588

499589
- Add e2e tests that verify pod-level control of swap utilization.
500590
- Add e2e tests that verify swap performance with pods using a tmpfs.
@@ -536,20 +626,25 @@ Here are specific improvements to be made:
536626
swap limit for workloads.
537627
- Investigate eviction behavior with swap enabled.
538628

539-
540-
#### Beta
541-
542-
- Add support for controlling swap consumption at the pod level [via cgroups].
543-
- Handle usage of swap during container restart boundaries for writes to tmpfs
544-
(which may require pod cgroup change beyond what container runtime will do at
545-
container cgroup boundary).
629+
#### Beta 1
630+
- Enable Swap Support using Burstable QoS Pods only.
631+
- Enable Swap Support for Cgroup v2 Only.
632+
- Enable Swap throttling support
546633
- Add swap memory to the Kubelet stats api.
547634
- Determine a set of metrics for node QoS in order to evaluate the performance
548635
of nodes with and without swap enabled.
549-
- Better understand relationship of swap with memory QoS in cgroup v2
550-
(particularly `memory.high` usage).
551-
- Collect feedback from test user cases.
636+
- Make sure node e2e jobs that use swap are healthy
552637
- Improve coverage for appropriate scenarios in testgrid.
638+
- Remove support for setting unlimited amount of swap (including [swapBehavior](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-MemorySwapConfiguration) flag of the Kubelet) as any workload with aggressive memory allocation can bring down a node with having no limits on swap usage.
639+
640+
#### Beta 2
641+
- Add support for controlling swap consumption at the container level [via cgroups] in Pod API in [container resources](https://github.com/kubernetes/kubernetes/blob/94a15929cf13354fdf3747cb266d511154f8c97b/staging/src/k8s.io/api/core/v1/types.go#L2443). More specifically add a new [ResourceName](https://github.com/kubernetes/kubernetes/blob/94a15929cf13354fdf3747cb266d511154f8c97b/staging/src/k8s.io/api/core/v1/types.go#L5522) `swap`. This will make sure we stay consistent with other resources like `cpu`, `memory` etc.
642+
- Publish a Kubernetes doc page encouring user to use encrypted swap if they wish to enable this feature.
643+
- Handle usage of swap during container restart boundaries for writes to tmpfs
644+
(which may require pod cgroup change beyond what container runtime will do at
645+
container cgroup boundary).
646+
- Deprecate the Swap Support using Burstable QoS Pods only introduced in Beta 1.
647+
553648

554649
[via cgroups]: #restrict-swap-usage-at-the-cgroup-level
555650

@@ -559,6 +654,8 @@ _(Tentative.)_
559654

560655
- Test a wide variety of scenarios that may be affected by swap support.
561656
- Remove feature flag.
657+
- Remove the Swap Support using Burstable QoS Pods only deprecated in Beta 2.
658+
562659

563660
### Upgrade / Downgrade Strategy
564661

keps/sig-node/2400-node-swap/kep.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ authors:
44
- "@ehashman"
55
- "@ike-ma"
66
- "@SergeyKanzhelev"
7+
- "@harche"
8+
- "@iholder101"
79
owning-sig: sig-node
810
participating-sigs:
911
- sig-node
@@ -18,12 +20,12 @@ approvers:
1820
- "@dchen1107"
1921

2022
# The target maturity stage in the current dev cycle for this KEP.
21-
stage: alpha
23+
stage: beta
2224

2325
# The most recent milestone for which work toward delivery of this KEP has been
2426
# done. This can be the current (upcoming) milestone, if it is being actively
2527
# worked on.
26-
latest-milestone: "v1.27"
28+
latest-milestone: "v1.28"
2729

2830
# The milestone at which this feature was, or is targeted to be, at each stage.
2931
milestone:

0 commit comments

Comments
 (0)