-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Implement pod overhead annotation #11824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Marked the PR as WIP for now, I need to test this against the operator side of things. |
a11b88b to
5dfadfb
Compare
|
Hey @c3d I’m curious - what settings would you set today relative to initial guest memory vs pod overhead if you had insight into the whole system (ie, at runtime you know the overhead applied to pod cgroup)? Ie overhead of “x”, additional memory (initial mem) to guest “y?” Where y is a % of x? |
5dfadfb to
e51958d
Compare
Hi @egernst. I am not sure I would set it as a percentage. For the orchestrator to do its job correctly with respect to scheduling pods, it needs to have an idea of how much memory is actually used. Some measurements I made in the past indicate that the upstream kernel used ~160M right after boot, the Red Hat kernel is more like 350M for older kernels, and upwards of 500M for more recent ones. This was all measured with the default initial memory of 2G. The page table and other kernel structures may depend on that initial memory size, so it's possible there would be variations if you started with a very large default memory size. Not sure if I answered the question? |
4ba328f to
a157951
Compare
a157951 to
f802d07
Compare
Add a `memory_overhead` annotation that an external operator running in an orchestrator such as Kubernetes or OpenShift can set, and take that value into account when hot-plugging memory. The reason this mechanism is needed is that an orchestrator may set a host-side cgroup based on what it knows, and it considers the "pod overhead" as specified for example by a runtime class instead of the default memory size as given in the configuration file. The pod overhead is supposed to estimate how much additional resources are required by the runtime, and it needs to be kept relatively low in order to avoid impacting container density too negatively. For example, consider typical values where: - the default VM memory in `configuration.toml` is 2048M, - the pod overhead is 350M, - you run a container that requests 4096M. When we create the pod VM, it boots with 2048M. We don't want the pod overhead in the orchestrator to be 2048M, though, since in most cases, the VM uses much less than that. The 350M given to the orchestrator indicates that we expect the VM to consume 350M in addition to what the workloads need. We then add the container to the pod. When the `memory_overhead` is not set, we hot-plug what the container requests, i.e. 4096M. As a result, the VM is now allocated 6144M, although it typically consumes less than that. If the container workload exceeds the 4096M allocated to it, it will be OOM-killed by the guest cgroup, and that's exactly what we want. However, if the container workload is close to that limit, say it uses 3900M, and also uses a lot of guest kernel memory resources, e.g. by doing lots of I/Os, some of that memory (e.g. file buffers) is not counted towards the workload by the guest cgroups. Thus, we may end up with a guest that consumes for example 5500M of memory. This is not compatible with what the orchestrator expects, which is 4096M for the workload + 350M for the overhead. If the orchestrator sets up a host-side cgroup based on what it knows, it will set the limit at 4446M, and a VM that tries to consume more than that will be OOM-killed _by the host_. Since we kill the hypervisor, we end up with messages that are much harder to understand and may not end up in the user-accessible orchestrator logs, e.g. ``` host kernel: oom_reaper: reaped process 4152164 (qemu-kvm), now anon-rss:0kB, file-rss:84kB, shmem-rss:33171748kB ``` The proposed solution is to adjust the first memory hotplug(s) to ensure that the guest memory lines up with the orchestrator's expectation. In the above example, the first hot-plug would not be 4096M, but 4096M - (2048M - 350M), making the VM size after hot-plug identical to what the orchestrator expects. If there is memory pressure, it now happens in the guest, and the OOM will now happen in the guest and not in the host. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>
If we do not hot-plug a large enough amount of memory initially (the minimum being the difference between default memory size and orchestrator-known overhead), the host-side cgroup will be smaller than the default VM size. In that case, it's still possible for the host=side cgroup to kill the VM with an out-of-memory condition. Add a warning when this scenario happens, with an indication on how to fix it. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>
Add an annotation to pass the memory overhead to the Rust runtime. This is similar to what is done for Go in the previous commits. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>
Take the memory_overhead annotation into account when computing amount of memory to hot-plug, to ensure that we do not exceed what the orchestrator expects and may have set in the host-side cgroup limit. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>
Take the memory_overhead annotation into account when computing amount of memory to hot-plug, to ensure that we do not exceed what the orchestrator expects and may have set in the host-side cgroup limit. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>
Take the memory_overhead annotation into account when computing amount of memory to hot-plug, to ensure that we do not exceed what the orchestrator expects and may have set in the host-side cgroup limit. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>
Add runtime tests for the memory_overhead annotation Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]> Assisted-by: Cursor with claude-4-sonnet and gpt-5 models
Add test logic to check the runtime annotations. Tessts Added: 1. Annotation Tests (src/runtime/pkg/oci/memory_overhead_test.go): * TestMemoryOverheadAnnotation() Tests valid and invalid memory overhead values * TestMemoryOverheadAnnotationDisabled() Tests that disabled annotations are rejected * TestMemoryOverheadAnnotationWithOtherMemoryAnnotations() Tests interaction with other memory annotations * TestMemoryOverheadAnnotationLargerThanDefaultMemory() Tests overhead values larger than default memory 2. Memory Overhead Compensation Tests (src/runtime/virtcontainers/memory_overhead_compensation_test.go): * TestMemoryOverheadCompensation() Tests the core compensation logic with various scenarios * TestMemoryOverheadCompensationEdgeCases() Tests edge cases like zero overhead and memory reduction * TestMemoryOverheadCompensationIntegration() Tests the complete flow with a mock sandbox The tests cover: * Valid values: 0, 10, 100, 256, 512, 1024 MiB * Invalid values: negative numbers, non-numeric strings, decimal values, empty strings * Annotation filtering: Ensures disabled annotations are properly rejected * Compensation logic: Tests the memory overhead compensation algorithm with various scenarios * Edge cases: Zero overhead, overhead equals memory size, memory reduction scenarios Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]> Assisted-by: Cursor with claude-4-sonnet and gpt-5 models
The tests generated by Cursor AI were exploring some unpected corner cases, such as using hot-plugging memory to reduce the allocated memory. Add defensive coding against these cases and emit error messages if that should happen. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>
The logic was changed to make sure that we do not hotplug, ever, if the initial overhead is larger than the default. Adjust tests accordingly. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]> Assisted-by: Cursor with claude-4-sonnet and gpt-5 models
f802d07 to
4490275
Compare
As described in #6129, an orchestrator may be setting a host-side cgroup with a size that is inconsistent with the maximum growth size for the VM.
This series of patches implements a memory overhead annotation that an orchestrator can use to notify the runtime that it knows about some overhead. In that case, we can adjust the first memory hotplug to make sure we do not add too much memory to the VM, so as to ensure that it remains within the bounds of the orchestrator's expectation.
Fixes: #6129
Will be useful in solving #6533 as well, i.e. make sure the runtime knows about the overhead associated with the current VM image.