[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders#44645
[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders#44645noa-neria wants to merge 3 commits into
Conversation
|
Hi @noa-neria, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@DarkLight1337 may you please help reviewing? |
|
Hi @noa-neria, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
1 similar comment
|
Hi @noa-neria, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Noa Neria <nneria@nvidia.com>
Signed-off-by: Noa Neria <nneria@nvidia.com>
Head branch was pushed to by a user without write access
fe6813e to
87e121b
Compare
|
Documentation preview: https://vllm--44645.org.readthedocs.build/en/44645/ |
|
Hi @noa-neria, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders
Purpose
Loading Llama-4-Scout/Maverick (
Llama4ForConditionalGeneration) with--load-format runai_streamerOOMs the host during weight loading: each TP worker transiently holds ~the entire language-model checkpoint in host RAM, so on a multi-GPU node the workers collectively exceed available memory and get OOM-killed (issue #44430).Root cause: the Llama4 weight-loading path fully materializes the weights iterator before loading anything into the model, in two places:
Llama4ForConditionalGeneration.load_weights(mllama4.py) drains the iterator into intermediate lists (_separate_and_rename_weights/_handle_expert_scale_broadcasting).Llama4ForCausalLM.load_weights(llama4.py) — the dominant one — appliespermute_qk_weight_for_rotaryover the iterator with a list comprehension, holding the whole language model as(name, tensor)tuples in a list before handing it toAutoWeightsLoader.This is harmless for the default loader because the tensors are zero-copy mmap views (file-backed page cache). But loaders that return private copies of each tensor end up with the full checkpoint resident as anonymous memory, multiplied across TP workers → host OOM.
The fix also prevents device OOM for weight loaders that return device tensors.
This PR makes the Llama4 load path stream (lazy) instead of materializing:
llama4.pyLlama4ForCausalLM.load_weights: change thepermute_qk_weight_for_rotarylist comprehension to a generator expression soAutoWeightsLoaderconsumes weights lazily (primary fix).mllama4.pyLlama4ForConditionalGeneration.load_weights: rewrite to stream the dominant language-model weights straight intoAutoWeightsLoader, buffering only the small vision/projector and scalar expert-scale groups. Removes the now-unused_separate_and_rename_weightsand_handle_expert_scale_broadcastinghelpers.Behavior is unchanged — the same weights are loaded, in the same order, into the same parameters; only when each weight is materialized changes (lazy vs. eager). Per-worker host residency drops from ~the full checkpoint to roughly the loader's own buffer plus one in-flight tensor. This benefits any copy-returning loader.
Test Plan
On a single 8×H200 node, TP=8,
--enforce-eager, modelLlama-4-Scout-17B-16E-Instruct(bf16, ~203 GiB), compare the stock loader vs. this change, measuring peak anonymous cgroup memory during load:vllm serve $MODEL --tensor-parallel-size 8 --max-model-len 8192 --enforce-eager --load-format runai_streamer— stock (this fix reverted) and fixed.vllm serve $MODEL ... --load-format auto— reference / correctness ground truth.anonfrom/sys/fs/cgroup/memory.statonce per second to capture the peak.auto-loader output.Test Result
runai_streamerrunai_streamerautoCorrectness (temp 0):
auto:"The capital of France is Paris."runai_streamer(fixed):"The capital of France is Paris."— byte-identical.