Implement pod overhead annotation #11824

c3d · 2025-09-18T13:23:53Z

As described in #6129, an orchestrator may be setting a host-side cgroup with a size that is inconsistent with the maximum growth size for the VM.

This series of patches implements a memory overhead annotation that an orchestrator can use to notify the runtime that it knows about some overhead. In that case, we can adjust the first memory hotplug to make sure we do not add too much memory to the VM, so as to ensure that it remains within the bounds of the orchestrator's expectation.

Fixes: #6129

Will be useful in solving #6533 as well, i.e. make sure the runtime knows about the overhead associated with the current VM image.

c3d · 2025-09-18T13:27:51Z

Marked the PR as WIP for now, I need to test this against the operator side of things.

egernst · 2025-09-18T14:10:30Z

Hey @c3d I’m curious - what settings would you set today relative to initial guest memory vs pod overhead if you had insight into the whole system (ie, at runtime you know the overhead applied to pod cgroup)? Ie overhead of “x”, additional memory (initial mem) to guest “y?” Where y is a % of x?

c3d · 2025-09-25T09:23:56Z

Hey @c3d I’m curious - what settings would you set today relative to initial guest memory vs pod overhead if you had insight into the whole system (ie, at runtime you know the overhead applied to pod cgroup)? Ie overhead of “x”, additional memory (initial mem) to guest “y?” Where y is a % of x?

Hi @egernst. I am not sure I would set it as a percentage. For the orchestrator to do its job correctly with respect to scheduling pods, it needs to have an idea of how much memory is actually used. Some measurements I made in the past indicate that the upstream kernel used ~160M right after boot, the Red Hat kernel is more like 350M for older kernels, and upwards of 500M for more recent ones. This was all measured with the default initial memory of 2G. The page table and other kernel structures may depend on that initial memory size, so it's possible there would be variations if you started with a very large default memory size.

Not sure if I answered the question?

Add a `memory_overhead` annotation that an external operator running in an orchestrator such as Kubernetes or OpenShift can set, and take that value into account when hot-plugging memory. The reason this mechanism is needed is that an orchestrator may set a host-side cgroup based on what it knows, and it considers the "pod overhead" as specified for example by a runtime class instead of the default memory size as given in the configuration file. The pod overhead is supposed to estimate how much additional resources are required by the runtime, and it needs to be kept relatively low in order to avoid impacting container density too negatively. For example, consider typical values where: - the default VM memory in `configuration.toml` is 2048M, - the pod overhead is 350M, - you run a container that requests 4096M. When we create the pod VM, it boots with 2048M. We don't want the pod overhead in the orchestrator to be 2048M, though, since in most cases, the VM uses much less than that. The 350M given to the orchestrator indicates that we expect the VM to consume 350M in addition to what the workloads need. We then add the container to the pod. When the `memory_overhead` is not set, we hot-plug what the container requests, i.e. 4096M. As a result, the VM is now allocated 6144M, although it typically consumes less than that. If the container workload exceeds the 4096M allocated to it, it will be OOM-killed by the guest cgroup, and that's exactly what we want. However, if the container workload is close to that limit, say it uses 3900M, and also uses a lot of guest kernel memory resources, e.g. by doing lots of I/Os, some of that memory (e.g. file buffers) is not counted towards the workload by the guest cgroups. Thus, we may end up with a guest that consumes for example 5500M of memory. This is not compatible with what the orchestrator expects, which is 4096M for the workload + 350M for the overhead. If the orchestrator sets up a host-side cgroup based on what it knows, it will set the limit at 4446M, and a VM that tries to consume more than that will be OOM-killed _by the host_. Since we kill the hypervisor, we end up with messages that are much harder to understand and may not end up in the user-accessible orchestrator logs, e.g. ``` host kernel: oom_reaper: reaped process 4152164 (qemu-kvm), now anon-rss:0kB, file-rss:84kB, shmem-rss:33171748kB ``` The proposed solution is to adjust the first memory hotplug(s) to ensure that the guest memory lines up with the orchestrator's expectation. In the above example, the first hot-plug would not be 4096M, but 4096M - (2048M - 350M), making the VM size after hot-plug identical to what the orchestrator expects. If there is memory pressure, it now happens in the guest, and the OOM will now happen in the guest and not in the host. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>

If we do not hot-plug a large enough amount of memory initially (the minimum being the difference between default memory size and orchestrator-known overhead), the host-side cgroup will be smaller than the default VM size. In that case, it's still possible for the host=side cgroup to kill the VM with an out-of-memory condition. Add a warning when this scenario happens, with an indication on how to fix it. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>

Add an annotation to pass the memory overhead to the Rust runtime. This is similar to what is done for Go in the previous commits. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>

Take the memory_overhead annotation into account when computing amount of memory to hot-plug, to ensure that we do not exceed what the orchestrator expects and may have set in the host-side cgroup limit. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>

Add runtime tests for the memory_overhead annotation Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]> Assisted-by: Cursor with claude-4-sonnet and gpt-5 models

Add test logic to check the runtime annotations. Tessts Added: 1. Annotation Tests (src/runtime/pkg/oci/memory_overhead_test.go): * TestMemoryOverheadAnnotation() Tests valid and invalid memory overhead values * TestMemoryOverheadAnnotationDisabled() Tests that disabled annotations are rejected * TestMemoryOverheadAnnotationWithOtherMemoryAnnotations() Tests interaction with other memory annotations * TestMemoryOverheadAnnotationLargerThanDefaultMemory() Tests overhead values larger than default memory 2. Memory Overhead Compensation Tests (src/runtime/virtcontainers/memory_overhead_compensation_test.go): * TestMemoryOverheadCompensation() Tests the core compensation logic with various scenarios * TestMemoryOverheadCompensationEdgeCases() Tests edge cases like zero overhead and memory reduction * TestMemoryOverheadCompensationIntegration() Tests the complete flow with a mock sandbox The tests cover: * Valid values: 0, 10, 100, 256, 512, 1024 MiB * Invalid values: negative numbers, non-numeric strings, decimal values, empty strings * Annotation filtering: Ensures disabled annotations are properly rejected * Compensation logic: Tests the memory overhead compensation algorithm with various scenarios * Edge cases: Zero overhead, overhead equals memory size, memory reduction scenarios Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]> Assisted-by: Cursor with claude-4-sonnet and gpt-5 models

The tests generated by Cursor AI were exploring some unpected corner cases, such as using hot-plugging memory to reduce the allocated memory. Add defensive coding against these cases and emit error messages if that should happen. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>

The logic was changed to make sure that we do not hotplug, ever, if the initial overhead is larger than the default. Adjust tests accordingly. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]> Assisted-by: Cursor with claude-4-sonnet and gpt-5 models

c3d requested a review from a team as a code owner September 18, 2025 13:23

c3d self-assigned this Sep 18, 2025

c3d added do-not-merge PR has problems or depends on another wip Work in Progress (PR incomplete - needs more work or rework) labels Sep 18, 2025

c3d force-pushed the bug/6129-pod-overhead-annotation branch from a11b88b to 5dfadfb Compare September 18, 2025 13:30

c3d mentioned this pull request Sep 18, 2025

controller: Add overhead annotation using pod mutating webhook openshift/sandboxed-containers-operator#1058

Open

c3d force-pushed the bug/6129-pod-overhead-annotation branch from 5dfadfb to e51958d Compare September 22, 2025 13:24

c3d force-pushed the bug/6129-pod-overhead-annotation branch 2 times, most recently from 4ba328f to a157951 Compare September 25, 2025 10:04

c3d force-pushed the bug/6129-pod-overhead-annotation branch from a157951 to f802d07 Compare October 27, 2025 16:16

c3d added 10 commits October 30, 2025 14:12

runtime-rs: Add memory_overhead annotation

b2e4496

Add an annotation to pass the memory overhead to the Rust runtime. This is similar to what is done for Go in the previous commits. Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]>

runtime-rs: Add tests for the previous functionality

ff31ff0

Add runtime tests for the memory_overhead annotation Fixes: kata-containers#6129 Signed-off-by: Christophe de Dinechin <[email protected]> Assisted-by: Cursor with claude-4-sonnet and gpt-5 models

c3d force-pushed the bug/6129-pod-overhead-annotation branch from f802d07 to 4490275 Compare October 30, 2025 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement pod overhead annotation #11824

Implement pod overhead annotation #11824

Uh oh!

c3d commented Sep 18, 2025

Uh oh!

c3d commented Sep 18, 2025

Uh oh!

egernst commented Sep 18, 2025

Uh oh!

c3d commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement pod overhead annotation #11824

Are you sure you want to change the base?

Implement pod overhead annotation #11824

Uh oh!

Conversation

c3d commented Sep 18, 2025

Uh oh!

c3d commented Sep 18, 2025

Uh oh!

egernst commented Sep 18, 2025

Uh oh!

c3d commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants