From 068b374180814203db3b1e9fe05299bf30582d7b Mon Sep 17 00:00:00 2001 From: Dixita Narang Date: Fri, 31 Mar 2023 19:26:35 +0000 Subject: [PATCH 1/2] Copying blog post content from memory QOS alphav1 --- .../container-memory-high.svg | 2 + .../container-memory-min.svg | 87 +++++++++++++ .../2023-03-08-memory-qos-cgroups-v2/index.md | 118 ++++++++++++++++++ .../memory-qos-cal.svg | 1 + .../node-memory-min.svg | 98 +++++++++++++++ .../pod-memory-min.svg | 97 ++++++++++++++ 6 files changed, 403 insertions(+) create mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-high.svg create mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-min.svg create mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/index.md create mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/memory-qos-cal.svg create mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/node-memory-min.svg create mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/pod-memory-min.svg diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-high.svg b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-high.svg new file mode 100644 index 0000000000000..a11906c96b32f --- /dev/null +++ b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-high.svg @@ -0,0 +1,2 @@ + + \ No newline at end of file diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-min.svg b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-min.svg new file mode 100644 index 0000000000000..f9711a641c521 --- /dev/null +++ b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-min.svg @@ -0,0 +1,87 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/index.md b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/index.md new file mode 100644 index 0000000000000..cc483270f3e1a --- /dev/null +++ b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/index.md @@ -0,0 +1,118 @@ +--- +layout: blog +title: 'Quality-of-Service for Memory Resources' +date: 2021-11-26 +slug: qos-memory-resources +--- + +**Authors:** Tim Xu (Tencent Cloud) + +Kubernetes v1.22, released in August 2021, introduced a new alpha feature that improves how Linux nodes implement memory resource requests and limits. + +In prior releases, Kubernetes did not support memory quality guarantees. +For example, if you set container resources as follows: +``` +apiVersion: v1 +kind: Pod +metadata: + name: example +spec: + containers: + - name: nginx + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "64Mi" + cpu: "500m" +``` +`spec.containers[].resources.requests`(e.g. cpu, memory) is designed for scheduling. When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. + +`spec.containers[].resources.limits` is passed to the container runtime when the kubelet starts a container. CPU is considered a "compressible" resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container, giving your app potentially worse performance. However, it won’t be terminated. That is what "compressible" means. + +In cgroup v1, and prior to this feature, the container runtime never took into account and effectively ignored spec.containers[].resources.requests["memory"]. This is unlike CPU, in which the container runtime consider both requests and limits. Furthermore, memory actually can't be compressed in cgroup v1. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated by the kernel with an OOM (Out of Memory) kill. + +Fortunately, cgroup v2 brings a new design and implementation to achieve full protection on memory. The new feature relies on cgroups v2 which most current operating system releases for Linux already provide. With this experimental feature, [quality-of-service for pods and containers](/docs/tasks/configure-pod-container/quality-service-pod/) extends to cover not just CPU time but memory as well. + +## How does it work? +Memory QoS uses the memory controller of cgroup v2 to guarantee memory resources in Kubernetes. Memory requests and limits of containers in pod are used to set specific interfaces `memory.min` and `memory.high` provided by the memory controller. When `memory.min` is set to memory requests, memory resources are reserved and never reclaimed by the kernel; this is how Memory QoS ensures the availability of memory for Kubernetes pods. And if memory limits are set in the container, this means that the system needs to limit container memory usage, Memory QoS uses `memory.high` to throttle workload approaching it's memory limit, ensuring that the system is not overwhelmed by instantaneous memory allocation. + +![](./memory-qos-cal.svg) + +The following table details the specific functions of these two parameters and how they correspond to Kubernetes container resources. + + + + + + + + + + + + + + +
FileDescription
memory.minmemory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup's memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked. +
+
+ We map it to the container's memory request +
memory.highmemory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit. +
+
+ We use a formula to calculate memory.high, depending on container's memory limit or node allocatable memory (if container's memory limit is empty) and a throttling factor. Please refer to the KEP for more details on the formula. +
+ +When container memory requests are made, kubelet passes `memory.min` to the back-end CRI runtime (possibly containerd, cri-o) via the `Unified` field in CRI during container creation. The `memory.min` in container level cgroup will be set to: + +![](./container-memory-min.svg) +i: the ith container in one pod + +Since the `memory.min` interface requires that the ancestor cgroup directories are all set, the pod and node cgroup directories need to be set correctly. + +`memory.min` in pod level cgroup: +![](./pod-memory-min.svg) +i: the ith container in one pod + +`memory.min` in node level cgroup: +![](./node-memory-min.svg) +i: the ith pod in one node, j: the jth container in one pod + +Kubelet will manage the cgroup hierarchy of the pod level and node level cgroups directly using runc libcontainer library, while container cgroup limits are managed by the container runtime. + +For memory limits, in addition to the original way of limiting memory usage, Memory QoS adds an additional feature of throttling memory allocation. A throttling factor is introduced as a multiplier (default is 0.8). If the result of multiplying memory limits by the factor is greater than memory requests, kubelet will set `memory.high` to the value and use `Unified` via CRI. And if the container does not specify memory limits, kubelet will use node allocatable memory instead. The `memory.high` in container level cgroup is set to: + +![](./container-memory-high.svg) +i: the ith container in one pod + +This can can help improve stability when pod memory usage increases, ensuring that memory is throttled as it approaches the memory limit. + +## How do I use it? +Here are the prerequisites for enabling Memory QoS on your Linux node, some of these are related to [Kubernetes support for cgroup v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2). + +1. Kubernetes since v1.22 +2. [runc](https://github.com/opencontainers/runc) since v1.0.0-rc93; [containerd](https://containerd.io/) since 1.4; [cri-o](https://cri-o.io/) since 1.20 +3. Linux kernel minimum version: 4.15, recommended version: 5.2+ +4. Linux image with cgroupv2 enabled or enabling cgroupv2 unified_cgroup_hierarchy manually + +OCI runtimes such as runc and crun already support cgroups v2 [`Unified`](https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#unified), and Kubernetes CRI has also made the desired changes to support passing [`Unified`](https://github.com/kubernetes/kubernetes/pull/102578). However, CRI Runtime support is required as well. Memory QoS in Alpha phase is designed to support containerd and cri-o. Related PR [Feature: containerd-cri support LinuxContainerResources.Unified #5627](https://github.com/containerd/containerd/pull/5627) has been merged and will be released in containerd 1.6. CRI-O [implement kube alpha features for 1.22 #5207](https://github.com/cri-o/cri-o/pull/5207) is still in WIP. + +With those prerequisites met, you can enable the memory QoS feature gate (see [Set kubelet parameters via a config file](/docs/tasks/administer-cluster/kubelet-config-file/)). + +## How can I learn more? + +You can find more details as follows: +- [Support Memory QoS with cgroup v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos/#readme) +- [cgroup v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2/#readme) + +## How do I get involved? +You can reach SIG Node by several means: +- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node) +- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node) +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode) + +You can also contact me directly: +- GitHub / Slack: @xiaoxubeii +- Email: xiaoxubeii@gmail.com diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/memory-qos-cal.svg b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/memory-qos-cal.svg new file mode 100644 index 0000000000000..545ee77b2997b --- /dev/null +++ b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/memory-qos-cal.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/node-memory-min.svg b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/node-memory-min.svg new file mode 100644 index 0000000000000..7c03aafc2817d --- /dev/null +++ b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/node-memory-min.svg @@ -0,0 +1,98 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/pod-memory-min.svg b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/pod-memory-min.svg new file mode 100644 index 0000000000000..84d3a940a07bc --- /dev/null +++ b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/pod-memory-min.svg @@ -0,0 +1,97 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file From be1de2e07f12710f87573938909d173ce0644810 Mon Sep 17 00:00:00 2001 From: Dixita Narang Date: Sun, 2 Apr 2023 19:14:14 +0000 Subject: [PATCH 2/2] Adding alhpav2 changes to memory QoS blog post to republish the new details Signed-off-by: Dixita Narang --- .../container-memory-high.svg | 2 - .../2023-03-08-memory-qos-cgroups-v2/index.md | 118 ------- .../memory-qos-cal.svg | 1 - .../pod-memory-min.svg | 97 ------ .../container-memory-high-best-effort.svg | 87 +++++ .../container-memory-high-limit.svg | 226 +++++++++++++ .../container-memory-high-no-limits.svg | 203 ++++++++++++ .../container-memory-high.svg | 252 +++++++++++++++ .../container-memory-max.svg} | 174 +++++----- .../container-memory-min.svg | 0 .../2023-05-05-memory-qos-cgroups-v2/index.md | 298 ++++++++++++++++++ .../memory-qos-cal.svg | 1 + 12 files changed, 1148 insertions(+), 311 deletions(-) delete mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-high.svg delete mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/index.md delete mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/memory-qos-cal.svg delete mode 100644 content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/pod-memory-min.svg create mode 100644 content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-best-effort.svg create mode 100644 content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-limit.svg create mode 100644 content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-no-limits.svg create mode 100644 content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high.svg rename content/en/blog/_posts/{2023-03-08-memory-qos-cgroups-v2/node-memory-min.svg => 2023-05-05-memory-qos-cgroups-v2/container-memory-max.svg} (62%) rename content/en/blog/_posts/{2023-03-08-memory-qos-cgroups-v2 => 2023-05-05-memory-qos-cgroups-v2}/container-memory-min.svg (100%) create mode 100644 content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/index.md create mode 100644 content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/memory-qos-cal.svg diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-high.svg b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-high.svg deleted file mode 100644 index a11906c96b32f..0000000000000 --- a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-high.svg +++ /dev/null @@ -1,2 +0,0 @@ - - \ No newline at end of file diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/index.md b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/index.md deleted file mode 100644 index cc483270f3e1a..0000000000000 --- a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/index.md +++ /dev/null @@ -1,118 +0,0 @@ ---- -layout: blog -title: 'Quality-of-Service for Memory Resources' -date: 2021-11-26 -slug: qos-memory-resources ---- - -**Authors:** Tim Xu (Tencent Cloud) - -Kubernetes v1.22, released in August 2021, introduced a new alpha feature that improves how Linux nodes implement memory resource requests and limits. - -In prior releases, Kubernetes did not support memory quality guarantees. -For example, if you set container resources as follows: -``` -apiVersion: v1 -kind: Pod -metadata: - name: example -spec: - containers: - - name: nginx - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "64Mi" - cpu: "500m" -``` -`spec.containers[].resources.requests`(e.g. cpu, memory) is designed for scheduling. When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. - -`spec.containers[].resources.limits` is passed to the container runtime when the kubelet starts a container. CPU is considered a "compressible" resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container, giving your app potentially worse performance. However, it won’t be terminated. That is what "compressible" means. - -In cgroup v1, and prior to this feature, the container runtime never took into account and effectively ignored spec.containers[].resources.requests["memory"]. This is unlike CPU, in which the container runtime consider both requests and limits. Furthermore, memory actually can't be compressed in cgroup v1. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated by the kernel with an OOM (Out of Memory) kill. - -Fortunately, cgroup v2 brings a new design and implementation to achieve full protection on memory. The new feature relies on cgroups v2 which most current operating system releases for Linux already provide. With this experimental feature, [quality-of-service for pods and containers](/docs/tasks/configure-pod-container/quality-service-pod/) extends to cover not just CPU time but memory as well. - -## How does it work? -Memory QoS uses the memory controller of cgroup v2 to guarantee memory resources in Kubernetes. Memory requests and limits of containers in pod are used to set specific interfaces `memory.min` and `memory.high` provided by the memory controller. When `memory.min` is set to memory requests, memory resources are reserved and never reclaimed by the kernel; this is how Memory QoS ensures the availability of memory for Kubernetes pods. And if memory limits are set in the container, this means that the system needs to limit container memory usage, Memory QoS uses `memory.high` to throttle workload approaching it's memory limit, ensuring that the system is not overwhelmed by instantaneous memory allocation. - -![](./memory-qos-cal.svg) - -The following table details the specific functions of these two parameters and how they correspond to Kubernetes container resources. - - - - - - - - - - - - - - -
FileDescription
memory.minmemory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup's memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked. -
-
- We map it to the container's memory request -
memory.highmemory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit. -
-
- We use a formula to calculate memory.high, depending on container's memory limit or node allocatable memory (if container's memory limit is empty) and a throttling factor. Please refer to the KEP for more details on the formula. -
- -When container memory requests are made, kubelet passes `memory.min` to the back-end CRI runtime (possibly containerd, cri-o) via the `Unified` field in CRI during container creation. The `memory.min` in container level cgroup will be set to: - -![](./container-memory-min.svg) -i: the ith container in one pod - -Since the `memory.min` interface requires that the ancestor cgroup directories are all set, the pod and node cgroup directories need to be set correctly. - -`memory.min` in pod level cgroup: -![](./pod-memory-min.svg) -i: the ith container in one pod - -`memory.min` in node level cgroup: -![](./node-memory-min.svg) -i: the ith pod in one node, j: the jth container in one pod - -Kubelet will manage the cgroup hierarchy of the pod level and node level cgroups directly using runc libcontainer library, while container cgroup limits are managed by the container runtime. - -For memory limits, in addition to the original way of limiting memory usage, Memory QoS adds an additional feature of throttling memory allocation. A throttling factor is introduced as a multiplier (default is 0.8). If the result of multiplying memory limits by the factor is greater than memory requests, kubelet will set `memory.high` to the value and use `Unified` via CRI. And if the container does not specify memory limits, kubelet will use node allocatable memory instead. The `memory.high` in container level cgroup is set to: - -![](./container-memory-high.svg) -i: the ith container in one pod - -This can can help improve stability when pod memory usage increases, ensuring that memory is throttled as it approaches the memory limit. - -## How do I use it? -Here are the prerequisites for enabling Memory QoS on your Linux node, some of these are related to [Kubernetes support for cgroup v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2). - -1. Kubernetes since v1.22 -2. [runc](https://github.com/opencontainers/runc) since v1.0.0-rc93; [containerd](https://containerd.io/) since 1.4; [cri-o](https://cri-o.io/) since 1.20 -3. Linux kernel minimum version: 4.15, recommended version: 5.2+ -4. Linux image with cgroupv2 enabled or enabling cgroupv2 unified_cgroup_hierarchy manually - -OCI runtimes such as runc and crun already support cgroups v2 [`Unified`](https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#unified), and Kubernetes CRI has also made the desired changes to support passing [`Unified`](https://github.com/kubernetes/kubernetes/pull/102578). However, CRI Runtime support is required as well. Memory QoS in Alpha phase is designed to support containerd and cri-o. Related PR [Feature: containerd-cri support LinuxContainerResources.Unified #5627](https://github.com/containerd/containerd/pull/5627) has been merged and will be released in containerd 1.6. CRI-O [implement kube alpha features for 1.22 #5207](https://github.com/cri-o/cri-o/pull/5207) is still in WIP. - -With those prerequisites met, you can enable the memory QoS feature gate (see [Set kubelet parameters via a config file](/docs/tasks/administer-cluster/kubelet-config-file/)). - -## How can I learn more? - -You can find more details as follows: -- [Support Memory QoS with cgroup v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos/#readme) -- [cgroup v2](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2/#readme) - -## How do I get involved? -You can reach SIG Node by several means: -- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node) -- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node) -- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode) - -You can also contact me directly: -- GitHub / Slack: @xiaoxubeii -- Email: xiaoxubeii@gmail.com diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/memory-qos-cal.svg b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/memory-qos-cal.svg deleted file mode 100644 index 545ee77b2997b..0000000000000 --- a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/memory-qos-cal.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/pod-memory-min.svg b/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/pod-memory-min.svg deleted file mode 100644 index 84d3a940a07bc..0000000000000 --- a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/pod-memory-min.svg +++ /dev/null @@ -1,97 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-best-effort.svg b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-best-effort.svg new file mode 100644 index 0000000000000..cf9283885855e --- /dev/null +++ b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-best-effort.svg @@ -0,0 +1,87 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-limit.svg b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-limit.svg new file mode 100644 index 0000000000000..3a545f20dd85f --- /dev/null +++ b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-limit.svg @@ -0,0 +1,226 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-no-limits.svg b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-no-limits.svg new file mode 100644 index 0000000000000..845f5d0d07bb2 --- /dev/null +++ b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high-no-limits.svg @@ -0,0 +1,203 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high.svg b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high.svg new file mode 100644 index 0000000000000..02357ef901582 --- /dev/null +++ b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-high.svg @@ -0,0 +1,252 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/node-memory-min.svg b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-max.svg similarity index 62% rename from content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/node-memory-min.svg rename to content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-max.svg index 7c03aafc2817d..5d4602069b957 100644 --- a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/node-memory-min.svg +++ b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-max.svg @@ -1,98 +1,86 @@ - - + + - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-min.svg b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-min.svg similarity index 100% rename from content/en/blog/_posts/2023-03-08-memory-qos-cgroups-v2/container-memory-min.svg rename to content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/container-memory-min.svg diff --git a/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/index.md b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/index.md new file mode 100644 index 0000000000000..01e7b957775a7 --- /dev/null +++ b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/index.md @@ -0,0 +1,298 @@ +--- +layout: blog +title: 'Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)' +date: 2023-05-05 +slug: qos-memory-resources +--- + +**Authors:** Dixita Narang (Google) + +Kubernetes v1.27, released in April 2023, introduced changes to +Memory QoS (alpha) to improve memory management capabilites in Linux nodes. + +Support for Memory QoS was initially added in Kubernetes v1.22, and later some +[limitations](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos#reasons-for-changing-the-formula-of-memoryhigh-calculation-in-alpha-v127) +around the formula for calculating `memory.high` were identified. These limitations are +addressed in Kubernetes v1.27. + +## Background + +Kubernetes allows you to optionally specify how much of each resources a container needs +in the Pod specification. The most common resources to specify are CPU and Memory. + +For example, a Pod manifest that defines container resource requirements could look like: +``` +apiVersion: v1 +kind: Pod +metadata: + name: example +spec: + containers: + - name: nginx + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "64Mi" + cpu: "500m" +``` + +* `spec.containers[].resources.requests` + + When you specify the resource request for containers in a Pod, the + [Kubernetes scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/#kube-scheduler) + uses this information to decide which node to place the Pod on. The scheduler + ensures that for each resource type, the sum of the resource requests of the + scheduled containers is less than the total allocatable resources on the node. + +* `spec.containers[].resources.limits` + + When you specify the resource limit for containers in a Pod, the kubelet enforces + those limits so that the running containers are not allowed to use more of those + resources than the limits you set. + +When the kubelet starts a container as a part of a Pod, kubelet passes the +container's requests and limits for CPU and memory to the container runtime. +The container runtime assigns both CPU request and CPU limit to a container. +Provided the system has free CPU time, the containers are guaranteed to be +allocated as much CPU as they request. Containers cannot use more CPU than +the configured limit i.e. containers CPU usage will be throttled if they +use more CPU than the specified limit within a given time slice. + +Prior to Memory QoS feature, the container runtime only used the memory +limit and discarded the memory `request` (requests were, and still are, +also used to influence [scheduling](/docs/concepts/scheduling-eviction/#scheduling)). +If a container uses more memory than the configured limit, +the Linux Out Of Memory (OOM) killer will be invoked. + +Let's compare how the container runtime on Linux typically configures memory +request and limit in cgroups, with and without Memory QoS feature: + +* **Memory request** + + The memory request is mainly used by kube-scheduler during (Kubernetes) Pod + scheduling. In cgroups v1, there are no controls to specify the minimum amount + of memory the cgroups must always retain. Hence, the container runtime did not + use the value of requested memory set in the Pod spec. + + cgroups v2 introduced a `memory.min` setting, used to specify the minimum + amount of memory that should remain available to the processes within + a given cgroup. If the memory usage of a cgroup is within its effective + min boundary, the cgroup’s memory won’t be reclaimed under any conditions. + If the kernel cannot maintain at least `memory.min` bytes of memory for the + processes within the cgroup, the kernel invokes its OOM killer. In other words, + the kernel guarantees at least this much memory is available or terminates + processes (which may be outside the cgroup) in order to make memory more available. + Memory QoS maps `memory.min` to `spec.containers[].resources.requests.memory` + to ensure the availability of memory for containers in Kubernetes Pods. + +* **Memory limit** + + The `memory.limit` specifies the memory limit, beyond which if the container tries + to allocate more memory, Linux kernel will terminate a process with an + OOM (Out of Memory) kill. If the terminated process was the main (or only) process + inside the container, the container may exit. + + In cgroups v1, `memory.limit_in_bytes` interface is used to set the memory usage limit. + However, unlike CPU, it was not possible to apply memory throttling: as soon as a + container crossed the memory limit, it would be OOM killed. + + In cgroups v2, `memory.max` is analogous to `memory.limit_in_bytes` in cgroupv1. + Memory QoS maps `memory.max` to `spec.containers[].resources.limits.memory` to + specify the hard limit for memory usage. If the memory consumption goes above this + level, the kernel invokes its OOM Killer. + + cgroups v2 also added `memory.high` configuration . Memory QoS uses `memory.high` + to set memory usage throttle limit. If the `memory.high` limit is breached, + the offending cgroups are throttled, and the kernel tries to reclaim memory + which may avoid an OOM kill. + +## How it works + +### Cgroups v2 memory controller interfaces & Kubernetes container resources mapping + +Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in +Kubernetes. cgroupv2 interfaces that this feature uses are: +* `memory.max` +* `memory.min` +* `memory.high`. + +{{< figure src="/blog/2023/05/05/qos-memory-resources/memory-qos-cal.svg" title="Memory QoS Levels" alt="Memory QoS Levels" >}} + +`memory.max` is mapped to `limits.memory` specified in the Pod spec. The kubelet and +the container runtime configure the limit in the respective cgroup. The kernel +enforces the limit to prevent the container from using more than the configured +resource limit. If a process in a container tries to consume more than the +specified limit, kernel terminates a process(es) with an out of +memory Out of Memory (OOM) error. + +{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-max.svg" title="memory.max maps to limits.memory" alt="memory.max maps to limits.memory" >}} + +`memory.min` is mapped to `requests.memory`, which results in reservation of memory resources +that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of +memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM +killer is invoked to make more memory available. + +{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-min.svg" title="memory.min maps to requests.memory" alt="memory.min maps to requests.memory" >}} + +For memory protection, in addition to the original way of limiting memory usage, Memory QoS +throttles workload approaching its memory limit, ensuring that the system is not overwhelmed +by sporadic increases in memory usage. A new field, `memoryThrottlingFactor`, is available in +the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default. +`memory.high` is mapped to throttling limit calculated by using `memoryThrottlingFactor`, +`requests.memory` and `limits.memory` as in the formula below, and rounding down the +value to the nearest page size: + +{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high.svg" title="memory.high formula" alt="memory.high formula" >}} + +**Note**: If a container has no memory limits specified, `limits.memory` is substituted for node allocatable memory. + +**Summary:** + + + + + + + + + + + + + + + + + +
FileDescription
memory.maxmemory.max specifies the maximum memory limit, + a container is allowed to use. If a process within the container + tries to consume more memory than the configured limit, + the kernel terminates the process with an Out of Memory (OOM) error. +
+
+ It is mapped to the container's memory limit specified in Pod manifest. +
memory.minmemory.min specifies a minimum amount of memory + the cgroups must always retain, i.e., memory that should never be + reclaimed by the system. + If there's no unprotected reclaimable memory available, OOM kill is invoked. +
+
+ It is mapped to the container's memory request specified in the Pod manifest. +
memory.highmemory.high specifies the memory usage throttle limit. + This is the main mechanism to control a cgroup's memory use. If + cgroups memory use goes over the high boundary specified here, + the cgroups processes are throttled and put under heavy reclaim pressure. +
+
+ Kubernetes uses a formula to calculate memory.high, + depending on container's memory request, memory limit or node allocatable memory + (if container's memory limit is empty) and a throttling factor. + Please refer to the KEP + for more details on the formula. +
+ +**Note** `memory.high` is set only on container level cgroups while `memory.min` is set on +container, pod, and node level cgroups. + +### `memory.min` calculations for cgroups heirarchy + +When container memory requests are made, kubelet passes `memory.min` to the back-end +CRI runtime (such as containerd or CRI-O) via the `Unified` field in CRI during +container creation. The `memory.min` in container level cgroups will be set to: + +$memory.min = pod.spec.containers[i].resources.requests[memory]$ +for every ith container in a pod +
+
+Since the `memory.min` interface requires that the ancestor cgroups directories are all +set, the pod and node cgroups directories need to be set correctly. + +`memory.min` in pod level cgroup: +$memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]$ +for every ith container in a pod +
+
+`memory.min` in node level cgroup: +$memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]$ +for every jth container in every ith pod on a node +
+
+Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups +directly using the libcontainer library (from the runc project), while container +cgroups limits are managed by the container runtime. + +### Support for Pod QoS classes + +Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like +to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling. +Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per +Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high +as per QOS classes: + +1. **Guaranteed pods** by their QoS definition require memory requests=memory limits and are +not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting +memory.high. This ensures that Guaranteed pods can fully use their memory requests up +to their set limit, and not hit any throttling. + +2. **Burstable pods** by their QoS definition require at least one container in the Pod with +CPU or memory request or limit set. + + * When requests.memory and limits.memory are set, the formula is used as-is: + + {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-limit.svg" title="memory.high when requests and limits are set" alt="memory.high when requests and limits are set" >}} + + * When requests.memory is set and limits.memory is not set, limits.memory is substituted + for node allocatable memory in the formula: + + {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-no-limits.svg" title="memory.high when requests and limits are not set" alt="memory.high when requests and limits are not set" >}} + +3. **BestEffort** by their QoS definition do not require any memory or CPU limits or requests. + For this case, kubernetes sets requests.memory = 0 and substitute limits.memory for node allocatable + memory in the formula: + + {{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-best-effort.svg" title="memory.high for BestEffort Pod" alt="memory.high for BestEffort Pod" >}} + +**Summary**: Only Pods in Burstable and BestEffort QoS classes will set `memory.high`. +Guaranteed QoS pods do not set `memory.high` as their memory is guaranteed. + +## How do I use it? + +The prerequisites for enabling Memory QoS feature on your Linux node are: + +1. Verify the [requirements](/docs/concepts/architecture/cgroups/#requirements) + related to [Kubernetes support for cgroups v2](/docs/concepts/architecture/cgroups) + are met. +2. Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd + and CRI-O provide support compatible with Memory QoS (alpha). This was implemented + in the following PRs: + * Containerd: [Feature: containerd-cri support LinuxContainerResources.Unified #5627](https://github.com/containerd/containerd/pull/5627). + * CRI-O: [implement kube alpha features for 1.22 #5207](https://github.com/cri-o/cri-o/pull/5207). + +Memory QoS remains an alpha feature for Kubernetes v1.27. You can enable the feature by setting +`MemoryQoS=true` in the kubelet configuration file: + +```yaml +apiVersion: kubelet.config.k8s.io/v1beta1 +kind: KubeletConfiguration +featureGates: + MemoryQoS: true +``` + +## How do I get involved? + +Huge thank you to all the contributors who helped with the design, implementation, +and review of this feature: + +* Dixita Narang ([ndixita](https://github.com/ndixita)) +* Tim Xu ([xiaoxubeii](https://github.com/xiaoxubeii)) +* Paco Xu ([pacoxu](https://github.com/pacoxu)) +* David Porter([bobbypage](https://github.com/bobbypage)) +* Mrunal Patel([mrunalp](https://github.com/mrunalp)) + +For those interested in getting involved in future discussions on Memory QoS feature, +you can reach out SIG Node by several means: +- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node) +- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node) +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode) diff --git a/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/memory-qos-cal.svg b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/memory-qos-cal.svg new file mode 100644 index 0000000000000..a85a2b1ea257b --- /dev/null +++ b/content/en/blog/_posts/2023-05-05-memory-qos-cgroups-v2/memory-qos-cal.svg @@ -0,0 +1 @@ + \ No newline at end of file