Avoid silent weights corruption when loading Nemotron Nano VL with reusable-buffer loaders like runai distributed streaming by noa-neria · Pull Request #42244 · vllm-project/vllm

noa-neria · 2026-05-10T17:34:53Z

Fixing bug #41749

Purpose

NemotronH_Nano_VL_V2.load_weights partitioned all checkpoint tensors into three lists (llm_weights, vision_weights, sound_weights) before any of them was loaded into model parameters. This is safe with the default loader, but with loaders that reuse an internal buffer between iterations (e.g. runai_streamer in distributed mode), the source tensors share an underlying buffer that gets overwritten as iteration advances — so holding references to them across the full partition pass results in silent weight corruption, since later items overwrite the buffer backing earlier ones still sitting in the lists.

Change

vllm/model_executor/models/nano_nemotron_vl.py:

The LLM weights (~97% of the model) are now streamed through an inner generator consumed by self.language_model.load_weights(...). Each tensor is copied into its parameter before the iterator advances, so no stale reference is retained.
The smaller mm components (mlp1, vision_model, sound_encoder) are detach().clone()d into per-component buffer lists during the same single pass, then loaded after the LLM completes. Cloning makes them independent of any reusable streamer buffer.
Existing load_multimodal_weights gating (skip mm components when image/video/audio prompt limits are all 0) is preserved.

No public API change.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request refactors the weight loading logic for the NanoNemotronVL model to use a generator for streaming LLM weights lazily, which prevents stale-reference corruption. Multimodal weights are now detached and cloned to maintain independence from reusable buffers, and tests have been updated with a mock tensor class to support these changes. Review feedback points out a fragile dependency where multimodal weights are only fully collected if the language model loader completely consumes the generator, suggesting that the generator should be explicitly exhausted to ensure all weights are loaded.

gemini-code-assist · 2026-05-10T17:39:32Z

+        def llm_weights_gen():
+            for name, w in weights:
+                if is_llm(name):
+                    # Strip 'language_model.' prefix for LLM weights
+                    yield ".".join(name.split(".")[1:]), w
+                elif is_adapter_weights((name, w)):
+                    if not load_multimodal_weights:
+                        continue
+                    trimmed_name = ".".join(name.split(".")[1:])
+                    adapter_weights.append((trimmed_name, w.detach().clone()))
+                elif is_vision_weights(name):
+                    if not load_multimodal_weights:
+                        continue
+                    # Convert: vision_model.radio_model.* → radio_model.*
+                    hf_key = name[len("vision_model.") :]
+                    vision_weights.append((hf_key, w.detach().clone()))
+                elif is_sound_weights(name):
+                    if not load_multimodal_weights:
+                        continue
+                    assert self.sound_encoder is not None
+                    sound_weights.append((name, w.detach().clone()))


The implementation of llm_weights_gen relies on the assumption that self.language_model.load_weights will fully consume the generator. If for any reason the language model's weight loader stops early (e.g., it only looks for a subset of weights), the multimodal weights (adapter_weights, vision_weights, sound_weights) will be only partially collected, leading to incomplete weight loading for those components. While the current AutoWeightsLoader in vLLM does consume the full iterable, this creates a fragile temporal coupling between the LLM loading phase and the multimodal collection phase. A more robust approach would be to ensure the generator is fully exhausted before proceeding to load multimodal components, or to explicitly document this dependency.

Fix

After self.language_model.load_weights(...) returns, the generator is now explicitly drained:

# Fully drain the generator so every mm tensor is buffered, even if # the LLM loader stops iterating early. llm_weights_iter = llm_weights_gen() self.language_model.load_weights(llm_weights_iter) for _ in llm_weights_iter: pass

Holding the generator in a named variable lets us iterate the remainder ourselves. Generators are stateful and resume from where the previous consumer left off, so this loop is a no-op when the LLM loader already consumed everything, and a safety net when it did not.

Why this resolves the concern

Removes the implicit dependency. Mm-buffer completeness no longer relies on the LLM loader's iteration behavior. Whether it drains the iterable, stops after N items, or never iterates at all, the mm branches see every input tensor.

No new pre-allocation or buffering. LLM weights are still streamed lazily — the drain loop reads the same generator the LLM loader was reading, so there's no extra accumulation step and no change to peak memory.

Order of operations is preserved. mm components are still loaded after the LLM, on the same control-flow path. The only added work is finishing the iterator, which by construction has at most the remaining unprocessed input weights.

DarkLight1337 · 2026-05-10T23:55:26Z

Please fix DCO

Signed-off-by: Noa Neria <nneria@nvidia.com>

…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com>

…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com>

noa-neria requested a review from tomeras91 as a code owner May 10, 2026 17:34

claude Bot reviewed May 10, 2026

View reviewed changes

mergify Bot added the multi-modality Related to multi-modality (#4194) label May 10, 2026

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

DarkLight1337 approved these changes May 10, 2026

View reviewed changes

DarkLight1337 added the verified Run pre-commit for new contributors without triggering other tests label May 10, 2026

noa-neria added 3 commits May 11, 2026 12:40

safe loading with runai streamer

ada17b7

Signed-off-by: Noa Neria <nneria@nvidia.com>

fix test

75cec3c

Signed-off-by: Noa Neria <nneria@nvidia.com>

ensure weights iterator is drained

0dbff7a

Signed-off-by: Noa Neria <nneria@nvidia.com>

noa-neria force-pushed the nemotron-loader branch from 6fbec03 to 0dbff7a Compare May 11, 2026 09:56

Merge branch 'main' into nemotron-loader

e05e73d

DarkLight1337 enabled auto-merge (squash) May 11, 2026 10:21

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 11, 2026

DarkLight1337 merged commit ac06214 into vllm-project:main May 11, 2026
61 checks passed

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

Avoid silent weights corruption when loading Nemotron Nano VL with re…

c9d9864

…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com>

h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026

Avoid silent weights corruption when loading Nemotron Nano VL with re…

d303164

…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com>

noa-neria mentioned this pull request Jun 8, 2026

[Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders #44645

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid silent weights corruption when loading Nemotron Nano VL with reusable-buffer loaders like runai distributed streaming#42244

Avoid silent weights corruption when loading Nemotron Nano VL with reusable-buffer loaders like runai distributed streaming#42244
DarkLight1337 merged 4 commits into
vllm-project:mainfrom
noa-neria:nemotron-loader

noa-neria commented May 10, 2026

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 10, 2026

Uh oh!

noa-neria May 11, 2026

Uh oh!

DarkLight1337 commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

noa-neria commented May 10, 2026

Purpose

Change

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

noa-neria May 11, 2026

Choose a reason for hiding this comment

Fix

Uh oh!

DarkLight1337 commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants