KEP-5759: Memory Manager Hugepages Availability Verification by srikalyan · Pull Request #5753 · kubernetes/enhancements

srikalyan · 2025-12-24T17:31:14Z

Summary

This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported free hugepages availability during pod admission.

Problem

The Memory Manager only tracks hugepage allocations for Guaranteed QoS pods. Burstable/BestEffort pods can consume hugepages (via hugetlbfs mounts or mmap with MAP_HUGETLB) without being tracked, causing subsequent Guaranteed pods to be admitted but fail at runtime when hugepages are exhausted.

Solution

cadvisor: Add FreePages field to HugePagesInfo with new GetCurrentHugepagesInfo() method for fresh sysfs reads (PR: Add FreePages to HugePagesInfo for hugepage availability reporting google/cadvisor#3804)
Memory Manager: Verify OS-reported free hugepages during Allocate() in Static policy
Admission: Reject pods when insufficient free hugepages are available

Enhancement Issue: Memory Manager Hugepages Availability Verification #5759
Kubernetes Issue: Static Memory Manager doesn't verify OS reported available huge pages during pod admission kubernetes#134395
cadvisor PR: Add FreePages to HugePagesInfo for hugepage availability reporting google/cadvisor#3804

KEP Metadata

SIG: sig-node
Stage: Alpha (target v1.36)
Feature Gate: MemoryManagerHugepagesVerification

/sig node
/kind kep

This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported free hugepages availability during pod admission. Problem: The Memory Manager only tracks hugepage allocations for Guaranteed QoS pods. Burstable/BestEffort pods can consume hugepages without being tracked, causing subsequent Guaranteed pods to be admitted but fail at runtime when hugepages are exhausted. Solution: - Add FreePages field to cadvisor's HugePagesInfo (PR google/cadvisor#3804) - Verify OS-reported free hugepages during Allocate() in Static policy - Reject pods when insufficient free hugepages are available Related: kubernetes/kubernetes#134395

k8s-ci-robot · 2025-12-24T17:31:19Z

@srikalyan: The label(s) area/kubelet cannot be applied, because the repository doesn't have them.

Details

In response to this:

Summary

This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported free hugepages availability during pod admission.

Problem

The Memory Manager only tracks hugepage allocations for Guaranteed QoS pods. Burstable/BestEffort pods can consume hugepages (via hugetlbfs mounts or mmap with MAP_HUGETLB) without being tracked, causing subsequent Guaranteed pods to be admitted but fail at runtime when hugepages are exhausted.

Solution

cadvisor: Add FreePages field to HugePagesInfo (PR: Add FreePages to HugePagesInfo for hugepage availability reporting google/cadvisor#3804)

Memory Manager: Verify OS-reported free hugepages during Allocate() in Static policy

Admission: Reject pods when insufficient free hugepages are available

Related

Issue: Static Memory Manager doesn't verify OS reported available huge pages during pod admission kubernetes#134395

cadvisor PR: Add FreePages to HugePagesInfo for hugepage availability reporting google/cadvisor#3804

KEP Metadata

SIG: sig-node

Stage: Alpha (target v1.33)

Feature Gate: MemoryManagerHugepagesVerification

/sig node
/kind kep
/area kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-12-24T17:31:23Z

Welcome @srikalyan!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-12-24T17:31:24Z

Hi @srikalyan. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

srikalyan · 2025-12-24T17:44:54Z

/remove-area kubelet

k8s-ci-robot · 2025-12-24T17:44:57Z

@srikalyan: Those labels are not set on the issue: area/kubelet

Details

In response to this:

/remove-area kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ffromani · 2025-12-26T08:19:03Z

/cc

ffromani · 2025-12-26T08:19:35Z

/ok-to-test

ffromani

Thanks for your contribution! I'm in favor of improving the accounting and making the memory manager/kubelet more predictable. I think we can benefit from some clarifications before to deep dive into further details.

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

ffromani · 2025-12-27T10:33:25Z

keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md

+1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or
+   `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager


this makes me think we need a better accounting/validation mechanism in general, not just for memory manager. Because the very issue we are attacking here is also relevant to burstable pods, and to some extent to best effort pods.

Agreed. This KEP focuses on the Memory Manager Static policy as a targeted fix, but the underlying issue of consistent hugepage accounting across QoS classes is worth discussing as a broader improvement. Perhaps a follow-up KEP for unified hugepage tracking?

ffromani · 2025-12-27T10:35:23Z

keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md

+This creates a problem when:
+1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or
+   `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager
+2. External processes or other system components consume hugepages


One of the key assumptions of how the kubelet operates in general, is that is the sole owner of a node. There is some leeway in some cases: we can pre-partition CPUs and make the kubelet assume it is the sole owner of the resource pool it got when started, and we can probably should do the same for hugepages. But in general, dynamic co-sharing of resources (kubelet races with other daemons or programs) is not supported and it's unlikely it ever will

Understood. To clarify: the issue here isn't external daemons racing with kubelet. It's that both pods are managed by kubelet and properly request hugepages. The gap is internal - scheduler tracks at node level, Memory Manager tracks at per-NUMA level but only for Guaranteed pods. The Burstable pod's hugepages are tracked by scheduler but not by Memory Manager's Static policy.

ffromani · 2025-12-27T10:35:37Z

keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md

+1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or
+   `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager
+2. External processes or other system components consume hugepages
+3. The Memory Manager's internal state becomes stale or inconsistent with reality


how can this happen? do we have examples or scenarios?

Yes! See kubernetes/kubernetes#134395 for the real-world scenario:

m6id.32xlarge with 2 NUMA nodes, 16GB of 2MB hugepages per node

Burstable pod requests ~12GB of 2MB hugepages → scheduled and runs

Guaranteed pod requests ~12GB of 2MB hugepages → admitted to NUMA node 1

Memory Manager thought node1 had ~15.2GB free

OS actually reported node1 had only ~3.2GB free

The gap: Memory Manager only tracks Guaranteed pods for NUMA placement.

ffromani · 2025-12-27T10:36:34Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+### Goals
+
+- Verify OS-reported free hugepages during pod admission for the Static policy
+- Reject pods requesting hugepages when insufficient free hugepages are available


we need to be mindful this will cause another opportunity for rejection loops like kubernetes/kubernetes#84869

Good point. I'll review that issue. The key difference here is that the rejection would be based on actual OS state (sysfs free_hugepages), not internal tracking discrepancy. This should make the rejection more accurate and actionable - the message would indicate "insufficient free hugepages on NUMA node X" rather than a vague resource conflict.

this is true, but it's still very likely that controllers will just create runaway pods, because this is another kubelet-local rejection the scheduler doesn't predict or expect, caused by information inbalance between the scheduler and the kubelet (cc ~~@wojtek-t~~ EDIT SORRY I meant @dom4ha - we talked this in the context of the kubelet-driven pod reschedules)

we can't solve these cases without scheduler actions, and this would bring the total to at least 3 causes of runaway pod creation:

TopologyAffinityError

SMTAlignmentError

now we are adding another error (HugePagesAlignmentError?).

Each of them makes sense from the node perspective - we can't admit a workload whose resource request can't be provided, but creates bad UX and storm of failed pods.

cc @44past4

Agreed -- this is a valid concern. Admission rejection at the node level is the right thing to do (we can't admit what we can't provide), but without scheduler awareness of these node-level rejections, the same pod can be repeatedly scheduled to the same node, creating a storm of failures.

This is a broader kubelet/scheduler coordination problem that affects TopologyAffinityError and SMTAlignmentError equally. Solving it properly likely requires scheduler-side changes (e.g., back-off or taint-based feedback from repeated admission failures), which is beyond the scope of this KEP.

For this KEP, the admission error will follow the same pattern as existing node-level rejections, so it doesn't make the overall situation worse -- it just prevents a silent runtime failure from becoming the failure mode instead.

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml

Key changes: - Update milestones to v1.36/v1.37/v1.38 - Clarify sysfs reading: add GetCurrentHugepagesInfo() for fresh reads (GetMachineInfo() is cached at startup, would be stale) - Add Integration with Topology Manager section with policy behavior table - Add Interaction with CPU Manager section - Address reserved hugepages (free_hugepages is correct metric) - Expand race condition discussion with failure handling details - Rewrite Story 2 as "Rapid Pod Churn" with clear timeline - Add "Static policy only" note (None policy not applicable) - Specify error message format with example - Add kubelet restart behavior note - Update Risks table with new mitigations - Fix unit test description (removed nil reference) - Update TOC with new sections - Link enhancement issue kubernetes#5759 Related: kubernetes#5759

ffromani · 2025-12-27T19:37:46Z

/retitle KEP-5759: Memory Manager Hugepages Availability Verification

- Add two implementation approaches: Option A (direct sysfs) and Option B (cadvisor) - Present pros/cons for each option neutrally for KEP review - Remove cadvisor-specific sections, replace with options discussion - Add Observability section with metrics, events, logs, alerting - Update TOC to pass CI verification - Update KEP number to 5759 throughout The choice between implementation approaches is left to KEP reviewers based on maintainability preferences and timeline considerations.

ffromani · 2025-12-27T21:10:49Z

Thanks @srikalyan for leading this effort. I'm in general supportive of this memory manager enhancement and, pending further review and elaborating, I do see the benefit of the proposed approach about checking free hugepages. Because there's some time left before the 1.36 cycle begins, I'd like to explore other options to solve this problem before we commit to the proposed direction. I'll have another review iteration ASAP.

srikalyan · 2026-01-06T23:08:37Z

@ffromani Happy new year to you. Can I request you for another review?

ffromani

thanks for the updates. The next step is to bring this up on the larger sig-node and in the 1.36 SIG planning. I think this work would be well accepted by the SIG, but let's make sure.

keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml

ffromani · 2026-01-12T13:16:52Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

@@ -0,0 +1,697 @@
+# KEP-5759: Memory Manager Hugepages Availability Verification


once we agree at sig-node level about this work, we need to add the prod-readiness tracking file.
I think this work deserves to be talked about on a sig-node meeting for coordination.

ffromani · 2026-01-12T13:22:26Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+### Goals
+
+- Verify OS-reported free hugepages during pod admission for the Static policy
+- Reject pods requesting hugepages when insufficient free hugepages are available


this is true, but it's still very likely that controllers will just create runaway pods, because this is another kubelet-local rejection the scheduler doesn't predict or expect, caused by information inbalance between the scheduler and the kubelet (cc ~~@wojtek-t~~ EDIT SORRY I meant @dom4ha - we talked this in the context of the kubelet-driven pod reschedules)

ffromani · 2026-01-12T13:56:04Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+From [issue #134395](https://github.com/kubernetes/kubernetes/issues/134395),
+on an m6id.32xlarge instance with 2 NUMA nodes:


This alone justifies the fix. The memory manager and the kubelet admission process failed to honor the pod contract. The pod was admitted, letting the workload to believe that resources are avaialble and allocatable, and they were not.
So this part is fine. What I'm circling around is the implications. I'm still thinking if we need to fix the admission phase in general, and if we should mitigate (or fix) the scheduler problem (https://github.com/kubernetes/enhancements/pull/5753/files#r2649077040)

There are pretty convincing arguments about the benefits of the logic being described here. But should we extend the memory manager or should we add a NUMA-aware admission check for any king of pod requiring hugepages?

the failure model is the same (kubelet admission error)

we should be able to easily get a pod NUMA affinity from another admission handler (without strong coupling between components)

the logic to check the per-NUMA current availability is the same

we have an obvious case to also check all pods regardless QoS (not just guaranteed)

I think it's worth elaborating pros and cons of both approaches

NOTE: this comes pretty late from me so I won't object if we just extend the memory manager and/or defer this to another alpha

Added Alternative 3, Please resolve if you are good.f

ffromani · 2026-01-12T13:57:01Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+- Track hugepage usage by Burstable or BestEffort pods in the Memory Manager
+- Modify scheduler behavior or add hugepage awareness to the scheduler
+- Provide hugepage reservation or preemption mechanisms
+- Support platforms other than Linux


thanks. It's possible sig-windows and sig-node are fine with a linux-first (or linux-only) solution, but it's good to ask the question nevertheless

ffromani · 2026-01-12T14:47:57Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+**Desired behavior**: The Guaranteed pod admission fails immediately with a clear
+error indicating insufficient free hugepages, allowing the scheduler to try
+another node or the administrator to take corrective action.


the alternative I'm thinking about is to extend the memory manager and admission logic to listen to each and every pod admission to track where (= which NUMA node) the hugepages are allocated from. However, If the kubelet doesn't enforce a cpuset.mems restriction, however, there's no way to know from where the hugepage is gonna be taken till the container processes go running, therefore past admission stage. Therefore, the proposed approach to check the actual free resources before each and every allocation attempt seems to be the best compromise (bar the only possible approach) in the current architecture.
We should probably document this in the "discarded alternatives" section.

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

ffromani · 2026-01-12T15:04:52Z

keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml

+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.36"


pending SIG discussion and approval, there's a good chance this work can start as beta per recent KEP graduation guidelines. The change is quite self contained and targeted, so it qualifies.

~~We got informal agreement in sig-node to start in beta~~

CORRECTION my fault, wrong recollection! But still my own point remains !

so you decided to start alpha?

Yes, targeting alpha for v1.36. The milestones have been relaxed to beta v1.38 and stable v1.40 per your feedback on the timeline being too aggressive.

srikalyan · 2026-01-16T17:07:13Z

thanks for the updates. The next step is to bring this up on the larger sig-node and in the 1.36 SIG planning. I think this work would be well accepted by the SIG, but let's make sure.

How do you recommend, I approach this?

- Add ffromani, derekwaynecarr, mrunalp as reviewers - Add dchen1107 as approver (sig-node OWNERS)

wendy-ha18 · 2026-01-26T07:46:50Z

thanks for the updates. The next step is to bring this up on the larger sig-node and in the 1.36 SIG planning. I think this work would be well accepted by the SIG, but let's make sure.

How do you recommend, I approach this?

Hi @srikalyan , sig node meeting weekly on Tuesdays at 10:00 PT (Pacific Time) so you can attend this week meeting to discuss more with SIG Node tech leads and chairs. Zoom link and detail can be viewed in here: https://github.com/kubernetes/community/tree/master/sig-node.

This KEP has /lead-opted-in and /milestone v1.36 label from SIG Node for it already, so I think we will target for first deadline is Production Readiness Freeze - 4th February 2026 (AoE) / Thursday 5th February 2026, 12:00 UTC.

srikalyan · 2026-01-26T14:01:04Z

Thank you Wendy,Will join this Tuesday.Sent from my iPhoneOn Jan 25, 2026, at 11:47 PM, Wendy Ha ***@***.***> wrote:wendy-ha18 left a comment (kubernetes/enhancements#5753) thanks for the updates. The next step is to bring this up on the larger sig-node and in the 1.36 SIG planning. I think this work would be well accepted by the SIG, but let's make sure. How do you recommend, I approach this? Hi @srikalyan , sig node meeting weekly on Tuesdays at 10:00 PT (Pacific Time) so you can attend this week meeting to discuss more with SIG Node tech leads and chairs. Zoom link and detail can be viewed in here: https://github.com/kubernetes/community/tree/master/sig-node. This KEP has /lead-opted-in and /milestone v1.36 label from SIG Node for it already, so I think we will target for first deadline is Production Readiness Freeze - 4th February 2026 (AoE) / Thursday 5th February 2026, 12:00 UTC. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

- Add haircommander (Peter Hunt) as KEP approver - Add PRR approval file for alpha stage with johnbelamaric as approver

k8s-ci-robot · 2026-02-01T07:48:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: srikalyan
Once this PR has been reviewed and has the lgtm label, please assign dchen1107, wojtek-t for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haircommander · 2026-02-02T17:49:20Z

FWIW: I think the need is clear and the code is pretty narrowly scoped. I am +1 on this, but we may not have TL bandwidth to get it done now

johnbelamaric

One small comment otherwise PRR looks good.

@ffromani I guess this is something that will not be an issue if we use DRA for huge pages, assuming we say create each NUMA node as a device with a consumable capacity? We still need to fix this of course.

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml

ffromani · 2026-02-10T17:07:31Z

@ffromani I guess this is something that will not be an issue if we use DRA for huge pages, assuming we say create each NUMA node as a device with a consumable capacity?

Yes, I think this is correct. We'd need to agree with the right attributes to expose, which will be an interesting discussion on its own, but the DRA model should prevent this issue completely.

Elaborating a bit on the DRA side (unrelated to this PR) the initial proposal is to use the NUMA node ID as proxy to identify the memory controller and to be able to bind it to a group of CPUs.

johnbelamaric · 2026-02-11T19:26:23Z

FYI, for PRR just awaiting SIG approval, I have one nit above but I consider it non-blocking. kep.yaml update does need to happen too though.

srikalyan · 2026-02-11T20:03:28Z

Thank you all. Will address the feedback soon.Sent from my iPhoneOn Feb 11, 2026, at 11:26 AM, John Belamaric ***@***.***> wrote:johnbelamaric left a comment (kubernetes/enhancements#5753) FYI, for PRR just awaiting SIG approval, I have one nit above but I consider it non-blocking. kep.yaml update does need to happen too though. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

Co-authored-by: Wendy Ha <139814343+wendy-ha18@users.noreply.github.com>

srikalyan · 2026-02-12T06:59:09Z

Thank you all. Will address the feedback soon.Sent from my iPhoneOn Feb 11, 2026, at 11:26 AM, John Belamaric @.> wrote:johnbelamaric left a comment (kubernetes/enhancements#5753) FYI, for PRR just awaiting SIG approval, I have one nit above but I consider it non-blocking. kep.yaml update does need to happen too though. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>

Thank you everyone for all the feedback and I have addressed all the feedback. Let me know if you have any questions.

ffromani · 2026-02-12T08:44:34Z

keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml

+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.36"


so you decided to start alpha?

keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml

ffromani · 2026-02-12T09:12:24Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+### Goals
+
+- Verify OS-reported free hugepages during pod admission for the Static policy
+- Reject pods requesting hugepages when insufficient free hugepages are available


we can't solve these cases without scheduler actions, and this would bring the total to at least 3 causes of runaway pod creation:

TopologyAffinityError

SMTAlignmentError

now we are adding another error (HugePagesAlignmentError?).

Each of them makes sense from the node perspective - we can't admit a workload whose resource request can't be provided, but creates bad UX and storm of failed pods.

cc @44past4

ffromani · 2026-02-12T09:13:49Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+### Non-Goals
+
+- Track hugepage usage by Burstable or BestEffort pods in the Memory Manager
+- Modify scheduler behavior or add hugepage awareness to the scheduler


Makes sense from a strictly node perspective but we still have a gap at cluster level and we may need further work; or, at very least, clear docs (cc @dom4ha @44past4 FYI folks)

ffromani · 2026-02-12T09:29:34Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+| Risk | Mitigation |
+|------|------------|
+| sysfs reads add latency to admission | Minimal impact: single file read per hugepage size per NUMA node; < 1ms typically |
+| False rejections due to transient consumption | Acceptable: better to reject than admit and fail at runtime; pod can be rescheduled |


we pingponged between the two. A cloud native system should react to workload running failure. But the core argument remains: the system is failing to actuate what it promised, and this needs fixing.

I mean over time, in the SIG and as general approach, not in this KEP.

Agreed. The updated text in this section now frames the core issue as a kubelet/workload contract breach rather than just a timing concern, which captures this point.

ffromani · 2026-02-12T09:58:46Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+**Downgrade**: Disabling the feature gate returns to previous behavior where
+OS hugepage availability is not verified. No data migration needed.


ffromani · 2026-02-12T09:59:41Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+Yes. Disabling the feature gate and restarting kubelet returns to previous
+behavior. No persistent state is affected.


Not sure we need a long-term toggle for this feature. I tend to believe it should be always-enabled.

ffromani · 2026-02-12T10:00:27Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+- [ ] Events
+  - Event Reason: `FailedHugepagesVerification`
+  - When: Pod admission rejected due to insufficient OS-reported free hugepages


probably other metrics here?

The current set covers the core signals: verification rate (_total), failure rate (_failures_total), and latency (_latency_seconds). Open to suggestions on what additional metrics would be useful here. One candidate could be memory_manager_hugepages_os_available_bytes (gauge per NUMA node per hugepage size) to give operators visibility into actual hugepage availability, but I'd prefer to evaluate this during alpha based on real operational experience rather than speculate now.

ffromani · 2026-02-12T10:00:53Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+Additional metrics that could be added in Beta:
+- `memory_manager_hugepages_discrepancy_bytes`: Gauge showing difference between
+  Memory Manager's internal tracking and OS-reported free hugepages (useful for
+  detecting drift)


but how's gonna be actionable from user/cluster-admin perspective?

Good point. The discrepancy metric on its own isn't directly actionable -- knowing there's drift doesn't tell the operator what to do about it. I've updated the KEP to remove this speculative metric and instead evaluate what's actually useful during alpha based on operational experience.

ffromani · 2026-02-12T10:10:23Z

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

+- Needs to independently resolve NUMA topology and candidate node selection, which
+  the Memory Manager already computes during `Allocate()`
+- Additional admission handler adds coordination overhead with existing handlers
+- For Guaranteed pods, the Memory Manager's allocation algorithm already selects
+  candidate NUMA nodes -- a standalone handler would duplicate or need to replicate
+  this selection logic to know which NUMA nodes to check
+- Larger implementation scope for alpha


can't we just use topology manager getaffinity? I'm unconvinced still this requires so much duplication

That's a fair point -- GetAffinity() could provide the NUMA hints without duplicating the selection logic. The standalone handler would still need to coordinate timing (it needs to run after topology hints are computed but before resources are committed), but the NUMA duplication concern in the cons is overstated.

That said, for alpha we're going with extending the Memory Manager where the candidate NUMA nodes are already available at the point of verification. If a standalone handler proves to be the better long-term approach (especially for covering non-Guaranteed pods), using GetAffinity() would be the right path. I've noted this alternative in the KEP for future consideration.

- Move metrics from Beta to Alpha graduation criteria per ffromani's request to have observability available at alpha stage - Change "TBD during alpha phase" to "Will be done during alpha phase" per johnbelamaric's nit on the upgrade/rollback testing question - Add Alternative 3: Standalone NUMA-aware hugepages admission handler with pros/cons analysis per ffromani's suggestion - Expand Alternative 1 with NUMA tracking limitation: without cpuset.mems enforcement, NUMA node allocation is unknown until container runtime, making per-pod tracking infeasible at admission - Reframe race condition caveat to emphasize kubelet/workload contract breach rather than just startup failure timing - Relax milestone timeline: beta v1.38, stable v1.40 - Remove sysfs availability from risk table (sysfs is a kubelet precondition) - Recommend Option A (direct sysfs reading) with rationale - Remove feature gate as safety mechanism framing throughout - Remove hardcoded error message format (not a public API) - Remove specific log format and alerting recommendation sections - Simplify Events section to describe behavior without locking format - Move conformance tests from GA to Beta criteria - Update GA to "feature always enabled (feature gate removed)" - Reword Upgrade/Downgrade without feature gate dependency - Update rollback answer to reflect always-enabled at GA - Replace speculative discrepancy metric with alpha evaluation plan

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Dec 24, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 24, 2025

k8s-ci-robot requested review from derekwaynecarr and mrunalp December 24, 2025 17:31

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 24, 2025

srikalyan mentioned this pull request Dec 24, 2025

kubelet: verify OS-reported free hugepages during pod admission kubernetes/kubernetes#135926

Open

3 tasks

k8s-ci-robot requested a review from ffromani December 26, 2025 08:19

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 26, 2025

Fix TOC to pass verify-toc CI check

9cd2277

ffromani reviewed Dec 27, 2025

View reviewed changes

srikalyan mentioned this pull request Dec 27, 2025

Memory Manager Hugepages Availability Verification #5759

Open

srikalyan force-pushed the kep-memory-manager-hugepages-verification branch from 5f71eb8 to fed79ac Compare December 27, 2025 17:24

srikalyan mentioned this pull request Dec 27, 2025

Add FreePages to HugePagesInfo for hugepage availability reporting google/cadvisor#3804

Draft

6 tasks

srikalyan force-pushed the kep-memory-manager-hugepages-verification branch from fed79ac to 9a89040 Compare December 27, 2025 17:41

k8s-ci-robot changed the title ~~KEP: Memory Manager Hugepages Availability Verification~~ KEP-5759: Memory Manager Hugepages Availability Verification Dec 27, 2025

srikalyan force-pushed the kep-memory-manager-hugepages-verification branch from c40cb0b to 8e6ae09 Compare December 27, 2025 20:15

ffromani reviewed Jan 12, 2026

View reviewed changes

SergeyKanzhelev moved this to Triage in SIG Node 1.36 KEPs planning Jan 13, 2026

SergeyKanzhelev added this to SIG Node 1.36 KEPs planning Jan 13, 2026

SergeyKanzhelev removed this from SIG Node 1.36 KEPs planning Jan 13, 2026

k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 22, 2026

KEP-5759: Add reviewers and approvers to kep.yaml

36099e3

- Add ffromani, derekwaynecarr, mrunalp as reviewers - Add dchen1107 as approver (sig-node OWNERS)

srikalyan force-pushed the kep-memory-manager-hugepages-verification branch from 2be55a9 to 36099e3 Compare January 24, 2026 22:43

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 24, 2026

KEP-5759: Add PRR approval file and update approvers

f81f922

- Add haircommander (Peter Hunt) as KEP approver - Add PRR approval file for alpha stage with johnbelamaric as approver

johnbelamaric reviewed Feb 10, 2026

View reviewed changes

keps/sig-node/5759-memory-manager-hugepages-verification/README.md Outdated Show resolved Hide resolved

wendy-ha18 reviewed Feb 10, 2026

View reviewed changes

keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml Outdated Show resolved Hide resolved

Apply suggestion from @wendy-ha18

4753aad

Co-authored-by: Wendy Ha <139814343+wendy-ha18@users.noreply.github.com>

ffromani reviewed Feb 12, 2026

View reviewed changes

srikalyan force-pushed the kep-memory-manager-hugepages-verification branch from d0147fb to f80d3ac Compare February 17, 2026 06:56

		1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or
		`mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager

		@@ -0,0 +1,697 @@
		# KEP-5759: Memory Manager Hugepages Availability Verification

		From [issue #134395](https://github.com/kubernetes/kubernetes/issues/134395),
		on an m6id.32xlarge instance with 2 NUMA nodes:

		Downgrade: Disabling the feature gate returns to previous behavior where
		OS hugepage availability is not verified. No data migration needed.

		Yes. Disabling the feature gate and restarting kubelet returns to previous
		behavior. No persistent state is affected.

Conversation

srikalyan commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Related

KEP Metadata

Uh oh!

k8s-ci-robot commented Dec 24, 2025

Summary

Problem

Solution

Related

KEP Metadata

Uh oh!

k8s-ci-robot commented Dec 24, 2025

Uh oh!

k8s-ci-robot commented Dec 24, 2025

Uh oh!

srikalyan commented Dec 24, 2025

Uh oh!

k8s-ci-robot commented Dec 24, 2025

Uh oh!

ffromani commented Dec 26, 2025

Uh oh!

ffromani commented Dec 26, 2025

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ffromani Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ffromani commented Dec 27, 2025

Uh oh!

ffromani commented Dec 27, 2025

Uh oh!

srikalyan commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ffromani Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srikalyan commented Dec 24, 2025 •

edited

Loading

ffromani Jan 12, 2026 •

edited

Loading

srikalyan commented Jan 6, 2026 •

edited

Loading

ffromani Jan 12, 2026 •

edited

Loading

ffromani Feb 10, 2026 •

edited

Loading