Skip to content

llama : add --hugepages for HugeTLB-backed weight loading (Linux)#21821

Closed
doctorjei wants to merge 4 commits into
ggml-org:masterfrom
doctorjei:hugepages-pr
Closed

llama : add --hugepages for HugeTLB-backed weight loading (Linux)#21821
doctorjei wants to merge 4 commits into
ggml-org:masterfrom
doctorjei:hugepages-pr

Conversation

@doctorjei

@doctorjei doctorjei commented Apr 12, 2026

Copy link
Copy Markdown

Overview

Addresses #2251 (partially — weights via --mmap path only; I hope to address --no-mmap and KV cache in a follow-up). I realize this is a non-trivial change (even though the implementation itself it just one file / two blocks), so I'm going into a bit more detail than I usually would with a PR. :)

Summary

Adds a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES) that backs model weight memory with anonymous 2 MiB HugeTLB pages on Linux.

Motivation: vmemmap reclamation
(not TLB speedup,v though it may facilitate future work in this area)

When HugeTLB Vmemmap Optimization (HVO, CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y) is enabled, the kernel frees the per-4 KiB struct page metadata within each hugepage. On a 128 GiB system this recovers ~1.75 GiB of kernel memory — enough to turn a tight-ceiling workload from OOM into working (which is what was happening to me).

The flag is opt-in at runtime, so there's no cost to anyone who doesn't use it.

# reserve the pool (runtime, no reboot for 2 MiB pages)
sudo sysctl -w vm.nr_hugepages=65536   # 65536 x 2 MiB = 128 GiB

# run with the flag
./llama-cli --hugepages -m model.gguf ...

Why?

Each 4 KiB page costs ~64 bytes of struct page metadata. With 128 GiB, that's ~2 GiB just to track pages. HugeTLB with HVO reduces this; a 2 MiB hugepage is only 4 KiB instead of 32 KiB.

System RAM Approx. Benefit
16 GiB ~224 MiB
64 GiB ~896 MiB
128 GiB ~1.75 GiB

Here's my real-life example: Strix APU with 128 GiB RAM (unified in my case) running MiniMax m2.5 IQ4_XS. The total footprint ~127,910 MiB against ~127,342 MiB available ("normal" pages). Saving 1,792 MiB pushes available memory to ~129,134 MiB, so now the model fits with ~1 GiB to spare.

Transparent Huge Pages (MADV_HUGEPAGE) don't help here, as THP remaps existing 4 KiB pages under a 2 MiB entry for TLB efficiency, but the struct page array stays intact (and saves no memory). Only explicit HugeTLB pool allocation with HVO addresses this.

I do want to be clear that currently this only works with CPU inference; that's because I haven't tackled hipMalloc. (That would open the door to directly passing memory to the GPU without reallocating.) I also am unsure of how this would work with other unified memory systems.
As of the latest commit, this works not only with CPU inference, but also with HIP / ROCm via buffer_from_host_ptr. Vulkan support should also be possible but will require a bit more digging. NVIDIA support is a bit more difficult, as I don't have an NVIDIA integrated chipset, and there is at least one known issue with the Jetson and this approach (#15034, #15923).

Approach

In #2251, @slaren suggested adding MAP_HUGETLB to the existing mmap call. @qdacsvx noted the kernel rejects MAP_HUGETLB on regular descriptors (EINVAL). This PR implements what @slaren had in mind:

  1. llama_mmap allocates an anonymous region with MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB | MAP_POPULATE
  2. load_all_data populates the region per-tensor via file->seek + file->read_raw (same pattern as the existing --no-mmap branch)
  3. After all tensors are loaded, mprotect downgrades to PROT_READ

MAP_POPULATE forces atomic pool allocation at mmap time, so if the pool is insufficient, it yields a clean ENOMEM (with a friendly diagnostic including sysctl guidance), not a SIGBUS during load.

Arg-parse rejects --hugepages combined with --no-mmap (wording points at a followup PR), --direct-io (the loader's existing conflict resolver would silently bypass our code), and non-Linux platforms.

Anonymous mapping is the only path to HugeTLB-backed weight memory without requiring a hugetlbfs mount, copying the model, and recompiling, a la PR #12521 #12552 (issue #12444), which isn't especially user-friendly. This PR avoids all of that — just sysctl + --hugepages.

PR #7420 also used anonymous mappings inside llama_mmap, but for direct-I/O bypass rather than hugepage backing. A concern raised there was that anonymous memory can swap under pressure. That doesn't apply here, since HugeTLB pages are inherently pinned by the kernel and cannot be swapped, reclaimed, or migrated regardless of memory pressure or RLIMIT_MEMLOCK settings. The can be relinquished / reused, though.

Tradeoffs

Warm-load time increases because --mmap normally shares page-cache pages zero-copy, while --hugepages must read_raw file into the anonymous region.

Measured on qwen3-235B-Q4 (19.4 GB), Strix Halo, Linux 6.17:

Path Cold Warm
--mmap baseline 3843 ms 547 ms
--hugepages 5030 ms 2319 ms
Slowdown 1.31x 4.24x

The warm 4.24x is driven by Strix Halo's copy_to_user bandwidth (~8.5 GB/s on LPDDR5X). Other platforms may see less. At the 128 GiB target scale,
warm-load adds ~11 seconds per session — in exchange for the model actually fitting.

Changes

There are three primary changes (in these three commits):

  • Adding the CLI parameter --hugepages (and LLAMA_ARG_HUGEPAGES)
  • Modifying function signatures & data structure to accommodate hugepages tracking
  • Implementation of the hugepages allocation (llama-mmap.cpp)

Testing

  • Builds clean on Linux x86_64, GCC 14.2, both CPU-only and HIP/ROCm configurations
  • End-to-end verified on Strix Halo (gfx1151, ROCm 7.2.0, Linux 6.17):
    model loads with --hugepages, inference produces correct output
    (39 t/s prompt, 8.2 t/s generation on qwen3 19.4 GB)
  • --help shows the flag with env var
  • Arg-parse rejections verified (--hugepages --no-mmap, --hugepages -dio)
  • Standalone harness validated the core mechanism on Strix Halo with a 19.4 GB GGUF:
    • Pool accounting exact (9424 x 2 MiB pages consumed/restored)
    • VmRSS confirms vmemmap reclamation: 3872 kB under --hugepages vs 19.30 GB baseline
    • No SIGBUS under MAP_POPULATE
    • read_raw EINTR/short-read handling inherited from existing code

HVO verification

For users to confirm HVO is active:

# compiled in?
grep CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP /boot/config-$(uname -r)
# → CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y

# enabled at runtime?
cat /proc/sys/vm/hugetlb_optimize_vmemmap
# → 1  (if 0: sudo sysctl -w vm.hugetlb_optimize_vmemmap=1)

# pool state
grep -i huge /proc/meminfo

Follow-ups (future work / not this PR)

  • --no-mmap path + KV cache + compute buffers via a new ggml_backend_cpu_hugetlb_buffer_type (parallel to the existing HBM pattern). Reuses
    the --hugepages flag.
  • Multi-threaded read_raw for load parallelism
  • posix_fadvise(POSIX_FADV_DONTNEED) on the source page cache after load
  • buffer_from_host_ptr is hardcoded false for CUDA/HIP (ggml-cuda.cu:4710). This PR is forward-compatible with a future flip for zero-copy on unified-memory APUs.

cc @ggerganov @slaren — per prior comments in #2251, I was hoping you could let me know what your take is on this approach. I know this issue matters for me. I've done my best to simplify / minimize code changes, but I am always happy to reconsider my approach as needed. (I hope it's OK to tag you; sorry if I missed a policy against it.)

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, AI was used in the process of preparing these commits.

AI was used to identify the appropriate strategy, draft a harness, and draft initial code snippets. Every line was reviewed, edited as appropriate, and included in commits (with sections of code separated manually into different commits with specific focus to ensure clarity and appropriate review (e.g., CLI parameter setup, data structures / call signatures, and final implementation were all separated into separate commits during review and manual processing.)

doctorjei and others added 3 commits April 12, 2026 14:02
…ing (Linux) (flag only, not implementation)

First commit to add --hugepages option (CLI). This commit only adds the structure for the flag but does not change any other code. (Commit 1/3)

Full descriptions included here for convenience...
---
Back model weight mappings with anonymous 2 MiB HugeTLB pages on
Linux, activated by a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES).

Primary benefit is kernel vmemmap reclamation via HugeTLB Vmemmap
Optimization (HVO) — not TLB speedup. On a 128 GiB system fully
backed with 2 MiB hugepages this frees ~1.75 GiB of struct page
memory, turning tight-ceiling workloads from OOM into working.

Mechanism: llama_mmap allocates MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB|
MAP_HUGE_2MB|MAP_POPULATE (zero-filled), then load_all_data populates
the region per-tensor via file->read_raw before check_tensors and
view allocation consume it. mprotect downgrades to PROT_READ after
load. MAP_POPULATE is a race-safety guarantee (pool-short → clean
ENOMEM at mmap time, not SIGBUS mid-load).

Measured on qwen3 19.4 GB / Strix Halo: 1.31x cold / 4.24x warm
slowdown vs baseline mmap; vmemmap reclamation confirmed via VmRSS
delta (3872 kB hugepages vs 19.30 GB baseline).
This is the second commit to add huge pages support. This commit prepares the data structures and function call signatures but does not change functionality. (commit 2/3)
…ing (Linux) (**implementation**)

This is the final commit of 3 to implement hugepages support. This is the implemenmtation; previous commits where preperatory. (The description below is identical to the first commit.)
---
Back model weight mappings with anonymous 2 MiB HugeTLB pages on
Linux, activated by a new --hugepages CLI flag (env: LLAMA_ARG_HUGEPAGES).

Primary benefit is kernel vmemmap reclamation via HugeTLB Vmemmap
Optimization (HVO) — not TLB speedup. On a 128 GiB system fully
backed with 2 MiB hugepages this frees ~1.75 GiB of struct page
memory, turning tight-ceiling workloads from OOM into working.

Mechanism: llama_mmap allocates MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB|
MAP_HUGE_2MB|MAP_POPULATE (zero-filled), then load_all_data populates
the region per-tensor via file->read_raw before check_tensors and
view allocation consume it. mprotect downgrades to PROT_READ after
load. MAP_POPULATE is a race-safety guarantee (pool-short → clean
ENOMEM at mmap time, not SIGBUS mid-load).

Measured on qwen3 19.4 GB / Strix Halo: 1.31x cold / 4.24x warm
slowdown vs baseline mmap; vmemmap reclamation confirmed via VmRSS
delta (3872 kB hugepages vs 19.30 GB baseline).
@doctorjei doctorjei requested review from a team, CISC and ggerganov as code owners April 12, 2026 19:10
@ggml-gh-bot

ggml-gh-bot Bot commented Apr 12, 2026

Copy link
Copy Markdown

Hi @doctorjei, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@sayap

sayap commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

I got this on a 128 GiB strix halo, does it look right? Thanks

$ grep -i huge /proc/meminfo
HugePages_Total:   62369
HugePages_Free:    62369
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        127731712 kB

@doctorjei

Copy link
Copy Markdown
Author

I got this on a 128 GiB strix halo, does it look right? Thanks

$ grep -i huge /proc/meminfo
HugePages_Total:   62369
HugePages_Free:    62369
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        127731712 kB

Yep! That's what you want to see.

BTW, I just added a clarification I foolishly left out. I haven't implemented hipMalloc yet. Until that is implemented, this will be limited to CPU inference. But, I don't anticipate that it will be too crazy. In hind sight, I probably should have held the PR until then, because I don't know too many people with 128 GiB looking for CPU inference, ha.

With that said, this is the groundwork needed for hipMalloc, which I don't think will be too crazy (famous last words, right?). I should be able to just have the memory "passed" to hipMalloc without copying.

@candrews

Copy link
Copy Markdown

Could the implementation be extended to also support Vulkan?

@doctorjei

doctorjei commented Apr 13, 2026

Copy link
Copy Markdown
Author

Could the implementation be extended to also support Vulkan?

Yes, the Vulkan stack solution is essentially the same (maybe easier, in fact) vs the HIP solution. This could theoretically also apply in Asahi Linux on Mac if "proper" GPU support is ever in place. This is notably not a problem in Windows or MacOS.

Edit: After some investigation, the Vuklan path is not as easy as I thought - but it's still doable. :) I'm actually wrapping up turning on HIP fully now.

Implement ggml_backend_cuda_device_buffer_from_host_ptr using
cudaHostRegister + cudaHostGetDevicePointer, enabling zero-copy
import of host-allocated memory (including hugepage-backed
regions) on unified-memory GPUs.

The capability is gated on GGML_USE_HIP && prop.integrated > 0,
read directly from cudaGetDeviceProperties. The existing force-
disable of info.devices[id].integrated (added for ggml-org#15034 on
NVIDIA Jetson) is left untouched; it affects a separate
cuda_host buffer path and does not overlap with
buffer_from_host_ptr.

Scope is limited to HIP because only Strix Halo / ROCm 7.2.0 has
been validated. NVIDIA Jetson reports prop.integrated == 1 and
may benefit, but this needs testing before extending. A TODO
comment in the code notes the extension path.

Adds an owned flag to ggml_backend_cuda_buffer_context so the
destructor can cudaHostUnregister externally-owned buffers
instead of cudaFree.
@doctorjei doctorjei requested review from a team and IMbackK as code owners April 19, 2026 00:08
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 19, 2026
@doctorjei

doctorjei commented Apr 19, 2026

Copy link
Copy Markdown
Author

Info about the latest commit...

Changes

File Modification
ggml/src/ggml-cuda/ggml-cuda.cu 2 macros
ggml/src/ggml-cuda/vendors/hip.h ggml_backend_cuda_device_buffer_from_host_ptr,
routing code with comments
  • Flips the .buffer_from_host_ptr (bfhp) backend capability from the hardcoded false to runtime-gated (HIP only)
  • Wires vtable slot from NULL to a handler (for HIP) built on hipHostRegister.
  • Reuses previously Metal-only branch at llama-model.cpp:7802 for HIP w. integrated GPU to pass hugetbl area directly to GPU (avoiding code duplication)

Potential Issues (for reviewers)

  • Dropped cudaHostRegisterReadOnly registration flag., as it triggered hipMemset illegal-access fault on load; loader write to it during quantized-tensor padding init.
  • Added && ctx->owned guard to init_tensor's quantized-padding memset (cause same as above). Also prevents potential out-of-bounds memset.

Benchmark — Strix Halo, ROCm 7.2.0, Linux 6.17.x

Model: qwen3 (MOE) 30B, Q4 variant (17.28 GiB)

Testcase pp512 (t/s) tg64 (t/s)
Mainline 565.48 ± 4.39 32.02 ± 0.91
HIP bfhp 577.79 ± 52.62 31.21 ± 0.99
HIP bfhp, hugepages 585.44 ± 72.00 32.22 ± 0.71

Result: pp512 and tg64 differences insignificant statistically; there's no throughput regression for the bfhp or bfhp + hugepages variants.

@JohannesGaessler

Copy link
Copy Markdown
Contributor

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

@doctorjei

Copy link
Copy Markdown
Author

@JohannesGaessler, I would be most grateful if you could tell me exactly what your objection is; as far as I know, I have not broken the policy. If there's something specific, I will fix it.

I spent a lot of time on this. If you don't want it, that's fine, and your prerogative. But if you really just think this is "slop" from some AI, you are mistaken. There posts were mine; the edits were mine; the responses we all mine. I am not even sure what makes you think this was AI generated. Hell, I even rebased and broke it into separate commits just to make it easier to follow.

The only thing I can think of is that I used AI to make some tables, and the pull request formatting might seem formulaic... is it one of those things? I'm grasping at straws here. I don't believe either of those violate the spirit or letter of the rules.

(I'm not trying to be flippant... but after all of my time spent, it would be nice to know exactly what the issue is so that I can correct it. My time is valuable, too. LLMs are not my full time job, but memory / kernels / etc are my bread and butter. I'm not trolling for merges here.)

@doctorjei

Copy link
Copy Markdown
Author

Hi @doctorjei, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

* **AI-generated content**: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Maybe because I didn't respond to this message directly, my PR was incorrectly flagged.

This is a mistake. I don't know how @ggml-gh-bot makes this determination, so I don't even know how to address it.

Re: the "AI generated content", I listed all of the details. I don't believe I violated the policy. But, it's all there in the clear! :) (I'll admit that I use a lot of dashes and stuff, which I hear is very AI like... ha.)

@JohannesGaessler

Copy link
Copy Markdown
Contributor

I read the OP and got the impression it was written by a language model, which is grounds for immediate closure.

@doctorjei

doctorjei commented Apr 20, 2026

Copy link
Copy Markdown
Author

I read the OP and got the impression it was written by a language model, which is grounds for immediate closure.

Ok. I appreciate the response. I can understand that impression. And I apologize for what I now see if an overly aggressive response. I did use tables (as I mentioned) and asked for a review of the text via LLM. I'll clear out the description and rewrite it - without an editing pass from an LLM - or if you prefer, submit a new PR.

Edit: oh, here's some sweet irony... I have typos in my OP even though I got editing help :D (I only say this as some evidence of the "human" nature of it (e.g., line under "motivation" has a random "v", and there are some other winners.. I see I duped some stuff in the "motivation" and "why" sections, too... anyway, I will continue rewriting now.)

@JohannesGaessler

Copy link
Copy Markdown
Contributor

To clarify why we have such a hard-line stance: we get a lot of garbage PRs where the human pressing the submit button has no actual understanding of the code. We explicitly disallow using a language models to write PR descriptions because that is the primary way to mask a lack of understanding. Rewriting a PR description after the fact is usually not a reason for reconsideration. I will come back to this PR after #20834 has been resolved.

@doctorjei

Copy link
Copy Markdown
Author

I understand why you do what you do, and I appreciate the effort. I'll be around when / if you want to consider it. Until then, I'll have access to it myself.

I know this is a big project and you have a lot of people and PRs. I'm just trying to do my part. Good luck.

@doctorjei

doctorjei commented Apr 25, 2026

Copy link
Copy Markdown
Author

From a review in another repo, I've made some tweaks:

  1. include/llama.h - moved use_hugepages to end of struct.
  2. src/llama-model-loader.cpp - switched to mapping->mmap_size() (from mapping->size())
  3. src/llama-mmap.cpp - added fallback MAP_HUGE_SHIFT (in case undefined in some glibc variants)

Other notes:

  1. The commit isn't showing here, I suppose because the PR is "closed" still; just wanted to note the additional update.
  2. There's an NVIDIA tag, but (I believe) this commit doesn't actually touch NVIDIA devices at all; at least for the moment, this is CPU and HIP only, because I haven't been able to test on NVIDIA yet (though I will put in the leg work to do so if this is considered for merging)
  3. Here's the fork this was recently merged into, if you want to see that review:
    Adds hugepages support to reduce metadata memory consumption domvox/llama.cpp-turboquant-hip#3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants