[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders by noa-neria · Pull Request #44645 · vllm-project/vllm

noa-neria · 2026-06-05T10:30:38Z

[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders

Purpose

Loading Llama-4-Scout/Maverick (Llama4ForConditionalGeneration) with --load-format runai_streamer OOMs the host during weight loading: each TP worker transiently holds ~the entire language-model checkpoint in host RAM, so on a multi-GPU node the workers collectively exceed available memory and get OOM-killed (issue #44430).

Root cause: the Llama4 weight-loading path fully materializes the weights iterator before loading anything into the model, in two places:

Llama4ForConditionalGeneration.load_weights (mllama4.py) drains the iterator into intermediate lists (_separate_and_rename_weights / _handle_expert_scale_broadcasting).
Llama4ForCausalLM.load_weights (llama4.py) — the dominant one — applies permute_qk_weight_for_rotary over the iterator with a list comprehension, holding the whole language model as (name, tensor) tuples in a list before handing it to AutoWeightsLoader.

This is harmless for the default loader because the tensors are zero-copy mmap views (file-backed page cache). But loaders that return private copies of each tensor end up with the full checkpoint resident as anonymous memory, multiplied across TP workers → host OOM.

The fix also prevents device OOM for weight loaders that return device tensors.

This PR makes the Llama4 load path stream (lazy) instead of materializing:

llama4.py Llama4ForCausalLM.load_weights: change the permute_qk_weight_for_rotary list comprehension to a generator expression so AutoWeightsLoader consumes weights lazily (primary fix).
mllama4.py Llama4ForConditionalGeneration.load_weights: rewrite to stream the dominant language-model weights straight into AutoWeightsLoader, buffering only the small vision/projector and scalar expert-scale groups. Removes the now-unused _separate_and_rename_weights and _handle_expert_scale_broadcasting helpers.

Behavior is unchanged — the same weights are loaded, in the same order, into the same parameters; only when each weight is materialized changes (lazy vs. eager). Per-worker host residency drops from ~the full checkpoint to roughly the loader's own buffer plus one in-flight tensor. This benefits any copy-returning loader.

Test Plan

On a single 8×H200 node, TP=8, --enforce-eager, model Llama-4-Scout-17B-16E-Instruct (bf16, ~203 GiB), compare the stock loader vs. this change, measuring peak anonymous cgroup memory during load:

vllm serve $MODEL --tensor-parallel-size 8 --max-model-len 8192 --enforce-eager --load-format runai_streamer — stock (this fix reverted) and fixed.
vllm serve $MODEL ... --load-format auto — reference / correctness ground truth.
Sample anon from /sys/fs/cgroup/memory.stat once per second to capture the peak.
Correctness: temp-0 chat completion with prompt "What is the capital of France? Answer in one short sentence.", compare the fixed-loader output to the auto-loader output.

Test Result

Variant	Loader	Result
baseline (fix reverted)	`runai_streamer`	host peak_anon ≈ 1019 GiB → OOMKilled mid-load
fixed	`runai_streamer`	bounded (~44 GiB/worker), no OOM, loads and serves
reference	`auto`	loads and serves, peak_anon ≈ 39 GiB

Correctness (temp 0):

auto: "The capital of France is Paris."
runai_streamer (fixed): "The capital of France is Paris." — byte-identical.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-06-05T10:44:11Z

Hi @noa-neria, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

noa-neria · 2026-06-08T14:50:59Z

@DarkLight1337 may you please help reviewing?
This PR is similar to #42244 for the Llama4 weight-loading path, which now fully drains the weights iterator before loading anything into the model.
This loading pattern can only be combined with the default loader which returns mmap view but not with other loaders which yield private copy of each tensor.

DarkLight1337

Thanks

mergify · 2026-06-08T15:02:58Z

Hi @noa-neria, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

mergify · 2026-06-08T16:22:49Z

Hi @noa-neria, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

mergify · 2026-06-10T19:20:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noa-neria.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Noa Neria <nneria@nvidia.com>

mergify · 2026-06-11T12:29:59Z

Documentation preview: https://vllm--44645.org.readthedocs.build/en/44645/

mergify · 2026-06-11T12:30:31Z

Hi @noa-neria, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

claude Bot reviewed Jun 5, 2026

View reviewed changes

mergify Bot added llama Related to Llama models bug Something isn't working labels Jun 5, 2026

DarkLight1337 approved these changes Jun 8, 2026

View reviewed changes

DarkLight1337 enabled auto-merge (squash) June 8, 2026 14:57

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 8, 2026

mergify Bot added the needs-rebase label Jun 10, 2026

noa-neria added 2 commits June 11, 2026 15:09

load llama4 model without draining the weights iterator

12107c9

Signed-off-by: Noa Neria <nneria@nvidia.com>

pre commit

87e121b

Signed-off-by: Noa Neria <nneria@nvidia.com>

auto-merge was automatically disabled June 11, 2026 12:29
Head branch was pushed to by a user without write access

noa-neria force-pushed the runai-streamer branch from fe6813e to 87e121b Compare June 11, 2026 12:29

mergify Bot added the documentation Improvements or additions to documentation label Jun 11, 2026

DarkLight1337 enabled auto-merge (squash) June 11, 2026 12:30

mergify Bot removed the needs-rebase label Jun 11, 2026

Merge branch 'main' into runai-streamer

646c5cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders#44645

[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders#44645
noa-neria wants to merge 3 commits into
vllm-project:mainfrom
noa-neria:runai-streamer

noa-neria commented Jun 5, 2026

Uh oh!

claude Bot left a comment

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

noa-neria commented Jun 8, 2026

Uh oh!

DarkLight1337 left a comment

Uh oh!

mergify Bot commented Jun 8, 2026

Uh oh!

mergify Bot commented Jun 8, 2026

Uh oh!

mergify Bot commented Jun 10, 2026

Uh oh!

mergify Bot commented Jun 11, 2026

Uh oh!

mergify Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

noa-neria commented Jun 5, 2026

[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

noa-neria commented Jun 8, 2026

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jun 8, 2026

Uh oh!

mergify Bot commented Jun 8, 2026

Uh oh!

mergify Bot commented Jun 10, 2026

Uh oh!

mergify Bot commented Jun 11, 2026

Uh oh!

mergify Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants