Skip to content

Avoid silent weights corruption when loading Nemotron Nano VL with reusable-buffer loaders like runai distributed streaming#42244

Merged
DarkLight1337 merged 4 commits into
vllm-project:mainfrom
noa-neria:nemotron-loader
May 11, 2026
Merged

Avoid silent weights corruption when loading Nemotron Nano VL with reusable-buffer loaders like runai distributed streaming#42244
DarkLight1337 merged 4 commits into
vllm-project:mainfrom
noa-neria:nemotron-loader

Conversation

@noa-neria

Copy link
Copy Markdown
Contributor

Fixing bug #41749

Purpose

NemotronH_Nano_VL_V2.load_weights partitioned all checkpoint tensors into three lists (llm_weights, vision_weights, sound_weights) before any of them was loaded into model parameters. This is safe with the default loader, but with loaders that reuse an internal buffer between iterations (e.g. runai_streamer in distributed mode), the source tensors share an underlying buffer that gets overwritten as iteration advances — so holding references to them across the full partition pass results in silent weight corruption, since later items overwrite the buffer backing earlier ones still sitting in the lists.

Change

vllm/model_executor/models/nano_nemotron_vl.py:

  • The LLM weights (~97% of the model) are now streamed through an inner generator consumed by self.language_model.load_weights(...). Each tensor is copied into its parameter before the iterator advances, so no stale reference is retained.
  • The smaller mm components (mlp1, vision_model, sound_encoder) are detach().clone()d into per-component buffer lists during the same single pass, then loaded after the LLM completes. Cloning makes them independent of any reusable streamer buffer.
  • Existing load_multimodal_weights gating (skip mm components when image/video/audio prompt limits are all 0) is preserved.

No public API change.

@noa-neria noa-neria requested a review from tomeras91 as a code owner May 10, 2026 17:34

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the multi-modality Related to multi-modality (#4194) label May 10, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the weight loading logic for the NanoNemotronVL model to use a generator for streaming LLM weights lazily, which prevents stale-reference corruption. Multimodal weights are now detached and cloned to maintain independence from reusable buffers, and tests have been updated with a mock tensor class to support these changes. Review feedback points out a fragile dependency where multimodal weights are only fully collected if the language model loader completely consumes the generator, suggesting that the generator should be explicitly exhausted to ensure all weights are loaded.

Comment on lines +1531 to +1551
def llm_weights_gen():
for name, w in weights:
if is_llm(name):
# Strip 'language_model.' prefix for LLM weights
yield ".".join(name.split(".")[1:]), w
elif is_adapter_weights((name, w)):
if not load_multimodal_weights:
continue
trimmed_name = ".".join(name.split(".")[1:])
adapter_weights.append((trimmed_name, w.detach().clone()))
elif is_vision_weights(name):
if not load_multimodal_weights:
continue
# Convert: vision_model.radio_model.* → radio_model.*
hf_key = name[len("vision_model.") :]
vision_weights.append((hf_key, w.detach().clone()))
elif is_sound_weights(name):
if not load_multimodal_weights:
continue
assert self.sound_encoder is not None
sound_weights.append((name, w.detach().clone()))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The implementation of llm_weights_gen relies on the assumption that self.language_model.load_weights will fully consume the generator. If for any reason the language model's weight loader stops early (e.g., it only looks for a subset of weights), the multimodal weights (adapter_weights, vision_weights, sound_weights) will be only partially collected, leading to incomplete weight loading for those components. While the current AutoWeightsLoader in vLLM does consume the full iterable, this creates a fragile temporal coupling between the LLM loading phase and the multimodal collection phase. A more robust approach would be to ensure the generator is fully exhausted before proceeding to load multimodal components, or to explicitly document this dependency.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix

After self.language_model.load_weights(...) returns, the generator is now explicitly drained:

# Fully drain the generator so every mm tensor is buffered, even if
# the LLM loader stops iterating early.
llm_weights_iter = llm_weights_gen()
self.language_model.load_weights(llm_weights_iter)                                                                                                                                                                                                                               
for _ in llm_weights_iter:
    pass                                                                                                                                                                                                                                                                         

Holding the generator in a named variable lets us iterate the remainder ourselves. Generators are stateful and resume from where the previous consumer left off, so this loop is a no-op when the LLM loader already consumed everything, and a safety net when it did not.

Why this resolves the concern

  • Removes the implicit dependency. Mm-buffer completeness no longer relies on the LLM loader's iteration behavior. Whether it drains the iterable, stops after N items, or never iterates at all, the mm branches see every input tensor.
  • No new pre-allocation or buffering. LLM weights are still streamed lazily — the drain loop reads the same generator the LLM loader was reading, so there's no extra accumulation step and no change to peak memory.
  • Order of operations is preserved. mm components are still loaded after the LLM, on the same control-flow path. The only added work is finishing the iterator, which by construction has at most the remaining unprocessed input weights.

@DarkLight1337

Copy link
Copy Markdown
Member

Please fix DCO

@DarkLight1337 DarkLight1337 added the verified Run pre-commit for new contributors without triggering other tests label May 10, 2026
noa-neria added 3 commits May 11, 2026 12:40
Signed-off-by: Noa Neria <nneria@nvidia.com>
Signed-off-by: Noa Neria <nneria@nvidia.com>
Signed-off-by: Noa Neria <nneria@nvidia.com>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) May 11, 2026 10:21
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 11, 2026
@DarkLight1337 DarkLight1337 merged commit ac06214 into vllm-project:main May 11, 2026
61 checks passed
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…usable-buffer loaders like runai distributed streaming (vllm-project#42244)

Signed-off-by: Noa Neria <nneria@nvidia.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…usable-buffer loaders like runai distributed streaming (vllm-project#42244)

Signed-off-by: Noa Neria <nneria@nvidia.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…usable-buffer loaders like runai distributed streaming (vllm-project#42244)

Signed-off-by: Noa Neria <nneria@nvidia.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
…usable-buffer loaders like runai distributed streaming (vllm-project#42244)

Signed-off-by: Noa Neria <nneria@nvidia.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…usable-buffer loaders like runai distributed streaming (vllm-project#42244)

Signed-off-by: Noa Neria <nneria@nvidia.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
…usable-buffer loaders like runai distributed streaming (vllm-project#42244)

Signed-off-by: Noa Neria <nneria@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed verified Run pre-commit for new contributors without triggering other tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants