Skip to content

Conversation

@c3d
Copy link
Member

@c3d c3d commented Sep 18, 2025

As described in #6129, an orchestrator may be setting a host-side cgroup with a size that is inconsistent with the maximum growth size for the VM.

This series of patches implements a memory overhead annotation that an orchestrator can use to notify the runtime that it knows about some overhead. In that case, we can adjust the first memory hotplug to make sure we do not add too much memory to the VM, so as to ensure that it remains within the bounds of the orchestrator's expectation.

Fixes: #6129

Will be useful in solving #6533 as well, i.e. make sure the runtime knows about the overhead associated with the current VM image.

@c3d c3d requested a review from a team as a code owner September 18, 2025 13:23
@c3d c3d self-assigned this Sep 18, 2025
@c3d c3d added do-not-merge PR has problems or depends on another wip Work in Progress (PR incomplete - needs more work or rework) labels Sep 18, 2025
@c3d
Copy link
Member Author

c3d commented Sep 18, 2025

Marked the PR as WIP for now, I need to test this against the operator side of things.

@egernst
Copy link
Member

egernst commented Sep 18, 2025

Hey @c3d I’m curious - what settings would you set today relative to initial guest memory vs pod overhead if you had insight into the whole system (ie, at runtime you know the overhead applied to pod cgroup)? Ie overhead of “x”, additional memory (initial mem) to guest “y?” Where y is a % of x?

@c3d c3d force-pushed the bug/6129-pod-overhead-annotation branch from 5dfadfb to e51958d Compare September 22, 2025 13:24
@c3d
Copy link
Member Author

c3d commented Sep 25, 2025

Hey @c3d I’m curious - what settings would you set today relative to initial guest memory vs pod overhead if you had insight into the whole system (ie, at runtime you know the overhead applied to pod cgroup)? Ie overhead of “x”, additional memory (initial mem) to guest “y?” Where y is a % of x?

Hi @egernst. I am not sure I would set it as a percentage. For the orchestrator to do its job correctly with respect to scheduling pods, it needs to have an idea of how much memory is actually used. Some measurements I made in the past indicate that the upstream kernel used ~160M right after boot, the Red Hat kernel is more like 350M for older kernels, and upwards of 500M for more recent ones. This was all measured with the default initial memory of 2G. The page table and other kernel structures may depend on that initial memory size, so it's possible there would be variations if you started with a very large default memory size.

Not sure if I answered the question?

@c3d c3d force-pushed the bug/6129-pod-overhead-annotation branch 2 times, most recently from 4ba328f to a157951 Compare September 25, 2025 10:04
@c3d c3d force-pushed the bug/6129-pod-overhead-annotation branch from a157951 to f802d07 Compare October 27, 2025 16:16
c3d added 10 commits October 30, 2025 14:12
Add a `memory_overhead` annotation that an external operator running
in an orchestrator such as Kubernetes or OpenShift can set, and
take that value into account when hot-plugging memory.

The reason this mechanism is needed is that an orchestrator may set a
host-side cgroup based on what it knows, and it considers the "pod
overhead" as specified for example by a runtime class instead of the
default memory size as given in the configuration file.

The pod overhead is supposed to estimate how much additional resources
are required by the runtime, and it needs to be kept relatively low in
order to avoid impacting container density too negatively.

For example, consider typical values where:
- the default VM memory in `configuration.toml` is 2048M,
- the pod overhead is 350M,
- you run a container that requests 4096M.

When we create the pod VM, it boots with 2048M. We don't want the pod
overhead in the orchestrator to be 2048M, though, since in most cases,
the VM uses much less than that. The 350M given to the orchestrator
indicates that we expect the VM to consume 350M in addition to what
the workloads need.

We then add the container to the pod. When the `memory_overhead` is
not set, we hot-plug what the container requests, i.e. 4096M. As a
result, the VM is now allocated 6144M, although it typically consumes
less than that. If the container workload exceeds the 4096M allocated
to it, it will be OOM-killed by the guest cgroup, and that's exactly
what we want.

However, if the container workload is close to that limit, say it uses
3900M, and also uses a lot of guest kernel memory resources, e.g. by
doing lots of I/Os, some of that memory (e.g. file buffers) is not
counted towards the workload by the guest cgroups. Thus, we may end up
with a guest that consumes for example 5500M of memory.

This is not compatible with what the orchestrator expects, which is
4096M for the workload + 350M for the overhead. If the orchestrator
sets up a host-side cgroup based on what it knows, it will set the
limit at 4446M, and a VM that tries to consume more than that will be
OOM-killed _by the host_. Since we kill the hypervisor, we end up with
messages that are much harder to understand and may not end up in the
user-accessible orchestrator logs, e.g.

```
host kernel: oom_reaper: reaped process 4152164 (qemu-kvm), now anon-rss:0kB, file-rss:84kB, shmem-rss:33171748kB
```

The proposed solution is to adjust the first memory hotplug(s) to
ensure that the guest memory lines up with the orchestrator's
expectation. In the above example, the first hot-plug would not be
4096M, but 4096M - (2048M - 350M), making the VM size after hot-plug
identical to what the orchestrator expects.

If there is memory pressure, it now happens in the guest, and the OOM
will now happen in the guest and not in the host.

Fixes: kata-containers#6129

Signed-off-by: Christophe de Dinechin <[email protected]>
If we do not hot-plug a large enough amount of memory initially
(the minimum being the difference between default memory size and
orchestrator-known overhead), the host-side cgroup will be smaller
than the default VM size. In that case, it's still possible for the
host=side cgroup to kill the VM with an out-of-memory condition.

Add a warning when this scenario happens, with an indication on how to
fix it.

Fixes: kata-containers#6129

Signed-off-by: Christophe de Dinechin <[email protected]>
Add an annotation to pass the memory overhead to the Rust runtime.
This is similar to what is done for Go in the previous commits.

Fixes: kata-containers#6129

Signed-off-by: Christophe de Dinechin <[email protected]>
Take the memory_overhead annotation into account when computing amount
of memory to hot-plug, to ensure that we do not exceed what the
orchestrator expects and may have set in the host-side cgroup limit.

Fixes: kata-containers#6129

Signed-off-by: Christophe de Dinechin <[email protected]>
Take the memory_overhead annotation into account when computing amount
of memory to hot-plug, to ensure that we do not exceed what the
orchestrator expects and may have set in the host-side cgroup limit.

Fixes: kata-containers#6129

Signed-off-by: Christophe de Dinechin <[email protected]>
Take the memory_overhead annotation into account when computing amount
of memory to hot-plug, to ensure that we do not exceed what the
orchestrator expects and may have set in the host-side cgroup limit.

Fixes: kata-containers#6129

Signed-off-by: Christophe de Dinechin <[email protected]>
Add runtime tests for the memory_overhead annotation

Fixes: kata-containers#6129

Signed-off-by: Christophe de Dinechin <[email protected]>
Assisted-by: Cursor with claude-4-sonnet and gpt-5 models
Add test logic to check the runtime annotations.

Tessts Added:
1. Annotation Tests (src/runtime/pkg/oci/memory_overhead_test.go):
* TestMemoryOverheadAnnotation()
  Tests valid and invalid memory overhead values
* TestMemoryOverheadAnnotationDisabled()
  Tests that disabled annotations are rejected
* TestMemoryOverheadAnnotationWithOtherMemoryAnnotations()
  Tests interaction with other memory annotations
* TestMemoryOverheadAnnotationLargerThanDefaultMemory()
  Tests overhead values larger than default memory

2. Memory Overhead Compensation Tests
   (src/runtime/virtcontainers/memory_overhead_compensation_test.go):
* TestMemoryOverheadCompensation()
  Tests the core compensation logic with various scenarios
* TestMemoryOverheadCompensationEdgeCases()
  Tests edge cases like zero overhead and memory reduction
* TestMemoryOverheadCompensationIntegration()
  Tests the complete flow with a mock sandbox

The tests cover:
* Valid values: 0, 10, 100, 256, 512, 1024 MiB
* Invalid values: negative numbers, non-numeric strings, decimal
  values, empty strings
* Annotation filtering: Ensures disabled annotations are properly
  rejected
* Compensation logic: Tests the memory overhead compensation algorithm
  with various scenarios
* Edge cases: Zero overhead, overhead equals memory size, memory
  reduction scenarios

Fixes: kata-containers#6129

Signed-off-by: Christophe de Dinechin <[email protected]>
Assisted-by: Cursor with claude-4-sonnet and gpt-5 models
The tests generated by Cursor AI were exploring some unpected corner
cases, such as using hot-plugging memory to reduce the allocated
memory. Add defensive coding against these cases and emit error
messages if that should happen.

Fixes: kata-containers#6129
Signed-off-by: Christophe de Dinechin <[email protected]>
The logic was changed to make sure that we do not hotplug, ever,
if the initial overhead is larger than the default.

Adjust tests accordingly.

Fixes: kata-containers#6129
Signed-off-by: Christophe de Dinechin <[email protected]>
Assisted-by: Cursor with claude-4-sonnet and gpt-5 models
@c3d c3d force-pushed the bug/6129-pod-overhead-annotation branch from f802d07 to 4490275 Compare October 30, 2025 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge PR has problems or depends on another wip Work in Progress (PR incomplete - needs more work or rework)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pod overhead calculation vs. host memory cgroup limit

2 participants