Skip to content

[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders#44645

Open
noa-neria wants to merge 3 commits into
vllm-project:mainfrom
noa-neria:runai-streamer
Open

[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders#44645
noa-neria wants to merge 3 commits into
vllm-project:mainfrom
noa-neria:runai-streamer

Conversation

@noa-neria

Copy link
Copy Markdown
Contributor

[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders

Purpose

Loading Llama-4-Scout/Maverick (Llama4ForConditionalGeneration) with --load-format runai_streamer OOMs the host during weight loading: each TP worker transiently holds ~the entire language-model checkpoint in host RAM, so on a multi-GPU node the workers collectively exceed available memory and get OOM-killed (issue #44430).

Root cause: the Llama4 weight-loading path fully materializes the weights iterator before loading anything into the model, in two places:

  1. Llama4ForConditionalGeneration.load_weights (mllama4.py) drains the iterator into intermediate lists (_separate_and_rename_weights / _handle_expert_scale_broadcasting).
  2. Llama4ForCausalLM.load_weights (llama4.py) — the dominant one — applies permute_qk_weight_for_rotary over the iterator with a list comprehension, holding the whole language model as (name, tensor) tuples in a list before handing it to AutoWeightsLoader.

This is harmless for the default loader because the tensors are zero-copy mmap views (file-backed page cache). But loaders that return private copies of each tensor end up with the full checkpoint resident as anonymous memory, multiplied across TP workers → host OOM.

The fix also prevents device OOM for weight loaders that return device tensors.

This PR makes the Llama4 load path stream (lazy) instead of materializing:

  • llama4.py Llama4ForCausalLM.load_weights: change the permute_qk_weight_for_rotary list comprehension to a generator expression so AutoWeightsLoader consumes weights lazily (primary fix).
  • mllama4.py Llama4ForConditionalGeneration.load_weights: rewrite to stream the dominant language-model weights straight into AutoWeightsLoader, buffering only the small vision/projector and scalar expert-scale groups. Removes the now-unused _separate_and_rename_weights and _handle_expert_scale_broadcasting helpers.

Behavior is unchanged — the same weights are loaded, in the same order, into the same parameters; only when each weight is materialized changes (lazy vs. eager). Per-worker host residency drops from ~the full checkpoint to roughly the loader's own buffer plus one in-flight tensor. This benefits any copy-returning loader.

Test Plan

On a single 8×H200 node, TP=8, --enforce-eager, model Llama-4-Scout-17B-16E-Instruct (bf16, ~203 GiB), compare the stock loader vs. this change, measuring peak anonymous cgroup memory during load:

  • vllm serve $MODEL --tensor-parallel-size 8 --max-model-len 8192 --enforce-eager --load-format runai_streamer — stock (this fix reverted) and fixed.
  • vllm serve $MODEL ... --load-format auto — reference / correctness ground truth.
  • Sample anon from /sys/fs/cgroup/memory.stat once per second to capture the peak.
  • Correctness: temp-0 chat completion with prompt "What is the capital of France? Answer in one short sentence.", compare the fixed-loader output to the auto-loader output.

Test Result

Variant Loader Result
baseline (fix reverted) runai_streamer host peak_anon ≈ 1019 GiB → OOMKilled mid-load
fixed runai_streamer bounded (~44 GiB/worker), no OOM, loads and serves
reference auto loads and serves, peak_anon ≈ 39 GiB

Correctness (temp 0):

  • auto: "The capital of France is Paris."
  • runai_streamer (fixed): "The capital of France is Paris."byte-identical.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added llama Related to Llama models bug Something isn't working labels Jun 5, 2026
@mergify

mergify Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Hi @noa-neria, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@noa-neria

Copy link
Copy Markdown
Contributor Author

@DarkLight1337 may you please help reviewing?
This PR is similar to #42244 for the Llama4 weight-loading path, which now fully drains the weights iterator before loading anything into the model.
This loading pattern can only be combined with the default loader which returns mmap view but not with other loaders which yield private copy of each tensor.

@DarkLight1337 DarkLight1337 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) June 8, 2026 14:57
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 8, 2026
@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Hi @noa-neria, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

1 similar comment
@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Hi @noa-neria, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify

mergify Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noa-neria.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 10, 2026
Signed-off-by: Noa Neria <nneria@nvidia.com>
Signed-off-by: Noa Neria <nneria@nvidia.com>
auto-merge was automatically disabled June 11, 2026 12:29

Head branch was pushed to by a user without write access

@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Documentation preview: https://vllm--44645.org.readthedocs.build/en/44645/

@mergify mergify Bot added the documentation Improvements or additions to documentation label Jun 11, 2026
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) June 11, 2026 12:30
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @noa-neria, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@mergify mergify Bot removed the needs-rebase label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation llama Related to Llama models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants