Avoid silent weights corruption when loading Nemotron Nano VL with reusable-buffer loaders like runai distributed streaming#42244
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the weight loading logic for the NanoNemotronVL model to use a generator for streaming LLM weights lazily, which prevents stale-reference corruption. Multimodal weights are now detached and cloned to maintain independence from reusable buffers, and tests have been updated with a mock tensor class to support these changes. Review feedback points out a fragile dependency where multimodal weights are only fully collected if the language model loader completely consumes the generator, suggesting that the generator should be explicitly exhausted to ensure all weights are loaded.
| def llm_weights_gen(): | ||
| for name, w in weights: | ||
| if is_llm(name): | ||
| # Strip 'language_model.' prefix for LLM weights | ||
| yield ".".join(name.split(".")[1:]), w | ||
| elif is_adapter_weights((name, w)): | ||
| if not load_multimodal_weights: | ||
| continue | ||
| trimmed_name = ".".join(name.split(".")[1:]) | ||
| adapter_weights.append((trimmed_name, w.detach().clone())) | ||
| elif is_vision_weights(name): | ||
| if not load_multimodal_weights: | ||
| continue | ||
| # Convert: vision_model.radio_model.* → radio_model.* | ||
| hf_key = name[len("vision_model.") :] | ||
| vision_weights.append((hf_key, w.detach().clone())) | ||
| elif is_sound_weights(name): | ||
| if not load_multimodal_weights: | ||
| continue | ||
| assert self.sound_encoder is not None | ||
| sound_weights.append((name, w.detach().clone())) |
There was a problem hiding this comment.
The implementation of llm_weights_gen relies on the assumption that self.language_model.load_weights will fully consume the generator. If for any reason the language model's weight loader stops early (e.g., it only looks for a subset of weights), the multimodal weights (adapter_weights, vision_weights, sound_weights) will be only partially collected, leading to incomplete weight loading for those components. While the current AutoWeightsLoader in vLLM does consume the full iterable, this creates a fragile temporal coupling between the LLM loading phase and the multimodal collection phase. A more robust approach would be to ensure the generator is fully exhausted before proceeding to load multimodal components, or to explicitly document this dependency.
There was a problem hiding this comment.
Fix
After self.language_model.load_weights(...) returns, the generator is now explicitly drained:
# Fully drain the generator so every mm tensor is buffered, even if
# the LLM loader stops iterating early.
llm_weights_iter = llm_weights_gen()
self.language_model.load_weights(llm_weights_iter)
for _ in llm_weights_iter:
pass Holding the generator in a named variable lets us iterate the remainder ourselves. Generators are stateful and resume from where the previous consumer left off, so this loop is a no-op when the LLM loader already consumed everything, and a safety net when it did not.
Why this resolves the concern
- Removes the implicit dependency. Mm-buffer completeness no longer relies on the LLM loader's iteration behavior. Whether it drains the iterable, stops after N items, or never iterates at all, the mm branches see every input tensor.
- No new pre-allocation or buffering. LLM weights are still streamed lazily — the drain loop reads the same generator the LLM loader was reading, so there's no extra accumulation step and no change to peak memory.
- Order of operations is preserved. mm components are still loaded after the LLM, on the same control-flow path. The only added work is finishing the iterator, which by construction has at most the remaining unprocessed input weights.
|
Please fix DCO |
Signed-off-by: Noa Neria <nneria@nvidia.com>
Signed-off-by: Noa Neria <nneria@nvidia.com>
6fbec03 to
0dbff7a
Compare
…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com>
…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com>
…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com>
…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com>
…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…usable-buffer loaders like runai distributed streaming (vllm-project#42244) Signed-off-by: Noa Neria <nneria@nvidia.com>
Fixing bug #41749
Purpose
NemotronH_Nano_VL_V2.load_weightspartitioned all checkpoint tensors into three lists (llm_weights,vision_weights,sound_weights) before any of them was loaded into model parameters. This is safe with the default loader, but with loaders that reuse an internal buffer between iterations (e.g.runai_streamerin distributed mode), the source tensors share an underlying buffer that gets overwritten as iteration advances — so holding references to them across the full partition pass results in silent weight corruption, since later items overwrite the buffer backing earlier ones still sitting in the lists.Change
vllm/model_executor/models/nano_nemotron_vl.py:self.language_model.load_weights(...). Each tensor is copied into its parameter before the iterator advances, so no stale reference is retained.mlp1,vision_model,sound_encoder) aredetach().clone()d into per-component buffer lists during the same single pass, then loaded after the LLM completes. Cloning makes them independent of any reusable streamer buffer.load_multimodal_weightsgating (skip mm components when image/video/audio prompt limits are all 0) is preserved.No public API change.