Conversation
📝 WalkthroughWalkthroughThis PR adds comprehensive Qwen3VL multimodal model support with decentralized process groups, including vision transformer blocks, context-parallel sequence packing, rotary embeddings, distributed training infrastructure, and example scripts for decentralized pretraining. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
src/megatron/bridge/training/utils/train_utils.py (1)
571-584:⚠️ Potential issue | 🔴 CriticalRemove unsupported
pg_collectionparameter fromtrack_moe_metricscall.The
track_moe_metricsfunction in megatron.core does not accept apg_collectionparameter. Passing it will cause a TypeError at runtime. Removepg_collection=pg_collectionfrom the call at line 583.src/megatron/bridge/training/vlm_step.py (1)
145-192: 🛠️ Refactor suggestion | 🟠 Major
if True:creates dead code — theelsebranch is unreachable.The
elseblock (lines 159–192) will never execute. Per coding guidelines, dead code should be removed before merging unless there's a documented reason to keep it. If the non-PP padding path is being intentionally preserved for future use, add a comment explaining why.Clean up: remove dead else branch
# When using pipeline parallelism, ensure fixed shapes equal to cfg.model.seq_length - if True: - seq_len = cfg.model.seq_length + seq_len = cfg.model.seq_length - tokens_or_input = batch.get("tokens") if batch.get("tokens") is not None else batch.get("input_ids") - ... - else: - # No PP: pad sequence length to nearest multiple of 128 for efficiency (capped at model seq_length) - seq_cap = cfg.model.seq_length - ... + tokens_or_input = batch.get("tokens") if batch.get("tokens") is not None else batch.get("input_ids") + tokens_or_input = pad_or_truncate_2d_to_len(tokens_or_input, seq_len, seq_len, pad_value=0) + ...As per coding guidelines: "If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it as debug code before merging".
tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_model.py (1)
314-323:⚠️ Potential issue | 🔴 CriticalPass
pg_collectiontomodel_no_decoder(line 314) andmodel_no_pre(line 366).Both instances omit the
pg_collectionparameter, but the Qwen3VLModel constructor immediately accesses its attributes (.cp,.tp,.pp,.embd) without null checks. This will cause anAttributeErrorat runtime.Lines 314–323
model_no_decoder = Qwen3VLModel( vision_transformer_config=vision_transformer_config, language_transformer_config=language_transformer_config, language_transformer_layer_spec=language_model_layer_spec, parallel_output=True, pre_process=True, post_process=True, add_encoder=True, add_decoder=False, pg_collection=pg_collection, )Also applies to: 366–375
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (2)
272-288:⚠️ Potential issue | 🟡 MinorSwitch to
T | Nonein the forward signature.The forward parameters use implicit
Optionaldefaults (e.g.,position_ids: torch.Tensor = None). Usetorch.Tensor | Noneto comply with typing guidelines for Python 3.10+.
77-112:⚠️ Potential issue | 🔴 CriticalRequire
pg_collectionor provide a default in the constructor.
pg_collectiondefaults toNone(line 77), but immediately accessed at lines 102-104 (.cp,.tp,.pp) and 112 (.embd) without null checks. This will crash withAttributeError: 'NoneType' object has no attributeifNoneis passed or the default is relied upon. While line 166 later reassigns fromget_pg_collection(self), the constructor fails before reaching that line.Either make the parameter required or initialize a default from
ProcessGroupCollection.use_mpu_process_groups()before using it.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py (1)
54-125:⚠️ Potential issue | 🟠 MajorGuard
cp_groupbefore calling.size().The
cp_groupparameter defaults toNonebutforward()unconditionally callsself.cp_group.size()at line 122. While current usage always provides a valid process group viapg_collection.cp, the parameter signature permitsNone. Add a guard or makecp_grouprequired to prevent runtime errors.Suggested fix
- self.cp_group = cp_group + if cp_group is None: + raise ValueError("cp_group must be provided for Qwen3VLMultimodalRotaryEmbedding") + self.cp_group = cp_group ... - if self.cp_group.size() > 1 and not self.is_thd_format: + if self.cp_group is not None and self.cp_group.size() > 1 and not self.is_thd_format:
🤖 Fix all issues with AI agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py`:
- Around line 16-42: The file uses a wildcard import from
megatron.core.transformer.attention and old typing forms; replace "from
megatron.core.transformer.attention import *" with explicit imports:
SelfAttention, BaseInferenceContext, PackedSeqParams,
deprecate_inference_params, nvtx_range_push, nvtx_range_pop, is_fa_min_version,
HAVE_FA3, and import Tensor from torch; then update the
Qwen3VLSelfAttention.forward signature and any related docstrings to modern PEP
604 and built-in generics syntax (e.g., Optional[Tensor] -> Tensor | None,
Union[Tensor, Tuple[Tensor, Tensor]] -> Tensor | tuple[Tensor, Tensor],
Tuple[Tensor, Tensor] -> tuple[Tensor, Tensor]) so symbols like forward,
SelfAttention, BaseInferenceContext, and PackedSeqParams use the new
annotations.
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py`:
- Line 60: The parameter pg_collection currently defaults to None but the code
later accesses pg_collection.cp unguarded, which will raise AttributeError; fix
by making pg_collection required (remove the default None from the parameter
declaration for pg_collection: ProcessGroupCollection) so callers must pass a
valid ProcessGroupCollection, or alternatively add a guard before any access to
pg_collection.cp (e.g., check if pg_collection is not None and handle the None
case or raise a clear ValueError), and update all usages in the
constructor/function where pg_collection is referenced (search for pg_collection
and pg_collection.cp around the constructor or methods at the top of
text_model.py) to ensure all accesses are protected or callers are updated.
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py`:
- Around line 429-432: The loop uses the undefined symbol
sharded_state_dict_default and an incorrect identity comparison syntax; import
sharded_state_dict_default at the top of transformer_block.py (where other
sharded helpers are imported) and change the conditional from "if (not module is
self.layers) and (not module is self.deepstack_merger_list):" to use proper
identity checks "if (module is not self.layers) and (module is not
self.deepstack_merger_list):" so sharded_state_dict_default(module, ...)
resolves and lint errors E714/F821 are fixed.
- Around line 56-83: The constructor Qwen3VLVisionTransformerBlock.__init__
dereferences pg_collection into cp_group/tp_group/pp_group without null
checking; update the init to validate and handle a missing pg_collection by
either requiring it (remove the default None) or initializing it when None
(e.g., call ProcessGroupCollection.use_mpu_process_groups() or another factory)
and then assign self.cp_group/self.tp_group/self.pp_group, or raise a clear
ValueError if pg_collection is None; apply the same null-check/fix pattern to
Qwen3VLVisionModel.__init__ to prevent AttributeError.
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py`:
- Around line 137-149: The tp_comm_buffer_name for the second MLP linear layer
is wrong: in the build_module call that constructs self.linear_fc2 you should
change tp_comm_buffer_name from "patch_fc1" to a distinct name (e.g.
"patch_fc2") to match the corresponding first layer (self.linear_fc1) and avoid
comm buffer collisions; update the build_module invocation for linear_fc2 so
tp_comm_buffer_name="patch_fc2".
- Around line 365-384: In split_data_cp_rank, the early None guard is
unreachable because val.shape is accessed before checking val is None; move the
"if val is None: return val" to the top of the function (before any assertions
or val.shape accesses), then perform the existing assertions (assert cp_size > 1
and the seq length divisibility) and the rest of the logic (view, index_select,
reshape) unchanged; ensure you still default cp_rank via
mpu.get_context_parallel_rank() after the None check so behavior of
split_data_cp_rank remains identical for non-None tensors.
- Line 493: The decorator currently uses `@torch.no_grad` without calling it,
which passes the function into no_grad instead of applying the context manager;
update the decorator to `@torch.no_grad`() on the function in
modelling_qwen3_vl/utils.py so the no-grad context is correctly applied (search
for the decorated function directly below the `@torch.no_grad` line and replace
the decorator with `@torch.no_grad`()).
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py`:
- Around line 35-51: The constructor __init__ of Qwen3VLVisionModel dereferences
pg_collection.tp without checking pg_collection (pg_collection defaults to
None), so either require pg_collection be passed or validate it: update __init__
to assert or raise a clear ValueError if pg_collection is None (e.g.,
"pg_collection must be provided"), and only set self.tp_group =
self.pg_collection.tp after that check (or remove the default None from the
pg_collection parameter to make it required); ensure the validation and
assignment occur before any use of self.tp_group.
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py`:
- Around line 195-211: The MoE bridge still uses ReplicatedMapping for the
vision module while the dense bridge switched to ConcatenatedQKVMapping for
vision QKV; update the MoE bridge to mirror those same mappings (add
ConcatenatedQKVMapping entries for
vision_model.decoder.layers.*.self_attention.linear_qkv.weight and .bias mapping
to model.visual.blocks.*.attn.qkv.weight and .bias, and keep ReplicatedMapping
for vision_model.patch_embed.proj.** and vision_model.pos_embed.**) or refactor
both bridges to import/share a common vision mapping block so the MoE conversion
uses the same QKV handling as the dense bridge.
In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 86-92: The code computes seqlens_in_batch when attention_mask is
None but leaves attention_mask as None, which causes a TypeError later when
pre_process=True and the code indexes attention_mask[i]; to fix, when
attention_mask is None initialize a boolean mask (e.g., attention_mask =
torch.ones((batch_size, input_ids.shape[1]), dtype=torch.bool,
device=input_ids.device)) and keep seqlens_in_batch consistent (torch.int32
sums) so subsequent uses like attention_mask[i] and boolean indexing in the
pre-processing block work correctly; update the branch that creates
seqlens_in_batch (and the analogous 130–145 region) to materialize this mask
instead of leaving attention_mask as None.
In `@src/megatron/bridge/training/vlm_step.py`:
- Around line 124-125: The code currently hard-sets is_first = True and is_last
= True and contains an unconditional "if True:" that makes the else branch
unreachable; update the VLM step so it either (a) restores correct per-rank
is_first/is_last logic (remove the hard-coded assignments and compute them from
pipeline parallel rank/size) and remove the "if True:" so the else branch is
used, or (b) if the hard-coding was intentional for CUDA graph support, leave
the behavior but remove the dead else branch and add a clear comment above the
assignments explaining: why all PP ranks must load labels/loss_mask (reference
get_batch_from_iterator where labels/loss_mask are loaded), why this bypasses
the encoder for CUDA graph, and which specific condition (e.g.,
CUDA_GRAPH_ENABLED) justifies the behavior; update the conditional around the
former "if True:" to a meaningful check or delete it and keep only the active
path so there is no unreachable code.
🟡 Minor comments (22)
src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py-39-43 (1)
39-43:⚠️ Potential issue | 🟡 MinorRemove unused import
get_vision_model_config.Static analysis confirms
get_vision_model_config(line 39) is imported but not used in this file. If it's needed by external consumers re-exporting from this module, add it to__all__; otherwise remove it.Proposed fix (if not re-exported)
-from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config import get_vision_model_config from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.utils import PatchMergerSubmodulesexamples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py-40-56 (1)
40-56:⚠️ Potential issue | 🟡 MinorMisleading comment: says "4B" but uses the 30B A3B config.
Line 42 says "Get the standard Qwen3 4B pretrain config" but calls
qwen3_vl_30b_a3b_pretrain_config, which is for the 30B-A3B model.Proposed fix
def main() -> None: """Run Qwen3 pretraining with decentralized process groups enabled.""" - # Get the standard Qwen3 4B pretrain config with overrides + # Get the standard Qwen3 VL 30B A3B pretrain config with overrides cfg = qwen3_vl_30b_a3b_pretrain_config(src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py-15-21 (1)
15-21:⚠️ Potential issue | 🟡 MinorRemove unused
torchimport.Static analysis flags
torch(line 20) as imported but unused. The function usestorch.nn.functional as Fbut nottorchdirectly.Proposed fix
from copy import deepcopy from dataclasses import dataclass, field from functools import partial from typing import List, Optional -import torch import torch.nn.functional as F from megatron.core.transformer.transformer_config import TransformerConfigsrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py-242-242 (1)
242-242:⚠️ Potential issue | 🟡 MinorTypo:
"flaaten"→"flatten".- assert input_ids.dim() == 1, "input_ids should be flaaten" + assert input_ids.dim() == 1, "input_ids should be flattened"src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py-348-349 (1)
348-349:⚠️ Potential issue | 🟡 MinorReplace
assert Falsewithraise ValueError(...)— assertions are removed underpython -O.As the static analysis tool notes,
assert Falsecan be silently stripped. Use an explicit exception for unreachable/invalid branches.Fix
else: - assert False, f"should not have {token_id=}" + raise ValueError(f"Unexpected token_id={token_id}")examples/recipes/qwen_vl/finetune_qwen_vl.py-106-108 (1)
106-108:⚠️ Potential issue | 🟡 MinorRemove unused
partialimport and fix import ordering.Static analysis flags
functools.partialas unused (F401). Additionally, the stdlib import on line 108 should be grouped with other stdlib imports (lines 89–93) per the import ordering guidelines.Fix
from megatron.bridge.training.qwen3vl_step import forward_step from megatron.bridge.utils.common_utils import get_rank_safe -from functools import partialAs per coding guidelines: "Organize imports in order: future imports, standard library, third-party, first-party, local folder imports, separated by blank lines".
src/megatron/bridge/training/qwen3vl_step.py-38-38 (1)
38-38:⚠️ Potential issue | 🟡 MinorRename module-level logger to
G_LOGGER.Global variables must use upper snake_case and be prefixed with
G.Suggested fix
-logger = logging.getLogger(__name__) +G_LOGGER = logging.getLogger(__name__)As per coding guidelines: Use upper snake_case and prefix 'G' for global variables (e.g., G_MY_GLOBAL).
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py-64-69 (1)
64-69:⚠️ Potential issue | 🟡 MinorRemove or justify commented‑out code.
The commented parameter line provides no explanation. Either remove it or add a comment explaining why it is intentionally disabled.
As per coding guidelines: If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-84-93 (1)
84-93:⚠️ Potential issue | 🟡 MinorAdd an explicit error message for unsupported path.
NotImplementedError()is raised without context, which makes failures opaque. Include a descriptive message.Suggested fix
- else: - raise NotImplementedError() + else: + raise NotImplementedError("Qwen3VLVisionModel only supports pre_process=True.")Based on learnings: In Python implementations, when a feature is not supported, raise an explicit error with a descriptive message.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-185-188 (1)
185-188:⚠️ Potential issue | 🟡 MinorRemove or justify commented‑out code.
This commented block lacks a rationale; remove it or explain why it must stay.
As per coding guidelines: If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-136-149 (1)
136-149:⚠️ Potential issue | 🟡 MinorUse
print_rank_0for rank‑safe logging.Direct
print()will spam logs across ranks. Please useprint_rank_0(or a rank‑aware logger).As per coding guidelines: Use 'print_rank_0' for logging in model bridge to avoid duplicate output across ranks.
src/megatron/bridge/training/qwen3vl_step.py-224-224 (1)
224-224:⚠️ Potential issue | 🟡 MinorRemove unused
max_seqlen_in_batch.
max_seqlen_in_batchis assigned but never used, which fails lint.Suggested fix
- max_seqlen_in_batch = seqlens_in_batch.max().item()src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py-126-127 (1)
126-127:⚠️ Potential issue | 🟡 MinorRemove the unnecessary f-string prefix. The string contains no placeholders, so the
fis unnecessary.Suggested fix
- raise ValueError(f"CUDA graphs must use flash decode with static batching!") + raise ValueError("CUDA graphs must use flash decode with static batching!")src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py-363-365 (1)
363-365:⚠️ Potential issue | 🟡 MinorUse
dict | Nonefor nullable metadata.Change
metadata: dict = Nonetometadata: dict | None = Noneto conform with Python 3.10+ type hints per coding guidelines.src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-197-203 (1)
197-203:⚠️ Potential issue | 🟡 MinorUpdate type annotations to use
T | Nonesyntax instead ofOptional[T].The
forwardmethod has three parameter type annotations that violate the coding guideline:
hidden_states: Optional[torch.Tensor]should betorch.Tensor | Noneinference_params: Optional[InferenceParams] = Noneshould beInferenceParams | Noneextra_block_kwargs: dict = Noneshould bedict | NonePer coding guidelines: Use 'T | None' for nullable types instead of 'Optional[T]'.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-1-15 (1)
1-15:⚠️ Potential issue | 🟡 MinorRemove unused imports to satisfy lint.
Static analysis correctly flags multiple unused imports:
dataclass,Union,MultimodalProjector,MegatronModule,build_module, andget_tensor_model_parallel_group_if_none. These should be removed while keepingOptionalandModuleSpecwhich are actively used.Suggested cleanup
-from dataclasses import dataclass -from typing import Optional, Union +from typing import Optional ... -from megatron.core.models.vision.multimodal_projector import MultimodalProjector ... -from megatron.core.transformer.module import MegatronModule -from megatron.core.transformer.spec_utils import ModuleSpec, build_module -from megatron.core.utils import get_tensor_model_parallel_group_if_none +from megatron.core.transformer.spec_utils import ModuleSpecsrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-15-19 (1)
15-19:⚠️ Potential issue | 🟡 MinorRemove unused
mpuimport.
mpuis imported on line 17 but is never used in the file.Suggested fix
-from megatron.core import InferenceParams, mpu, tensor_parallel +from megatron.core import InferenceParams, tensor_parallelsrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py-265-306 (1)
265-306:⚠️ Potential issue | 🟡 MinorAdd return type annotation and remove unused parameter.
apply_rotary_pos_emb_absoluteis missing a return type annotation (line 282). Additionally,cu_seqlensparameter inapply_rotary_pos_emb_thd_absolute(line 265) is unused in the function body and should be removed, along with its reference at the call site (line 302).Suggested fix
def apply_rotary_pos_emb_thd_absolute( - t: Tensor, cu_seqlens: Tensor, freqs: Tensor, rotary_interleaved: bool = False + t: Tensor, freqs: Tensor, rotary_interleaved: bool = False ) -> Tensor: ... return _apply_rotary_pos_emb_bshd(t[:, None], freqs, rotary_interleaved=rotary_interleaved).squeeze(1) -def apply_rotary_pos_emb_absolute( +def apply_rotary_pos_emb_absolute( t: Tensor, freqs: Tensor, config: Qwen3VLTransformerConfig, cu_seqlens: Optional[Tensor] = None, -): +) -> Tensor: ... if cu_seqlens is None: result = _apply_rotary_pos_emb_bshd(t, freqs, rotary_interleaved=config.rotary_interleaved) else: - result = apply_rotary_pos_emb_thd_absolute(t, cu_seqlens, freqs, rotary_interleaved=config.rotary_interleaved) + result = apply_rotary_pos_emb_thd_absolute(t, freqs, rotary_interleaved=config.rotary_interleaved) ...src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py-177-185 (1)
177-185:⚠️ Potential issue | 🟡 MinorRename loop variable to avoid shadowing the
input_idsparameter.The loop variable at line 177 shadows the outer
input_idsparameter, creating a subtle bug. After the loop exits (line 241),mrope_position_deltas = torch.tensor(mrope_position_deltas, device=input_ids.device)uses the sliced tensor from the last iteration rather than the original parameter. Rename tosample_input_idsand update all references within the loop.Suggested fix
- for i, input_ids in enumerate(total_input_ids): - input_ids = input_ids[attention_mask[i] == 1] + for i, sample_input_ids in enumerate(total_input_ids): + sample_input_ids = sample_input_ids[attention_mask[i] == 1] image_nums, video_nums = 0, 0 - vision_start_indices = torch.argwhere(input_ids == vision_start_token_id).squeeze(1) - vision_tokens = input_ids[vision_start_indices + 1] + vision_start_indices = torch.argwhere(sample_input_ids == vision_start_token_id).squeeze(1) + vision_tokens = sample_input_ids[vision_start_indices + 1] image_nums = (vision_tokens == image_token_id).sum() video_nums = (vision_tokens == video_token_id).sum() - input_tokens = input_ids.tolist() + input_tokens = sample_input_ids.tolist()src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py-16-55 (1)
16-55:⚠️ Potential issue | 🟡 MinorUse PEP 604 and built‑in generics for type hints.
The code uses
ListandOptionalfrom thetypingmodule, which conflicts with Python 3.10+ conventions. Update to uselist[...]for generic sequences andT | Nonefor nullable types, then remove unused typing imports.Suggested update
-from typing import List, Optional +from typing import TYPE_CHECKING ... - seq_len_interpolation_factor: Optional[float] = None, + seq_len_interpolation_factor: float | None = None, ... - mrope_section: List[int], + mrope_section: list[int], ... - input_ids: Optional[torch.LongTensor] = None, + input_ids: torch.LongTensor | None = None,src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-140-187 (1)
140-187:⚠️ Potential issue | 🟡 MinorAdd
strict=Trueto allzip()calls and replace unusedtvariable with_.Ruff (B905) flags all three
zip()calls for missing explicitstrict=parameter. Additionally, thetvariable at line 140 is unused within the loop body and should be replaced with_(B007).Suggested fix
- for t, h, w in zip(grid_ts, grid_hs, grid_ws): + for _, h, w in zip(grid_ts, grid_hs, grid_ws, strict=True): ... - patch_pos_embeds = patch_pos_embeds.split([h * w for h, w in zip(grid_hs, grid_ws)]) + patch_pos_embeds = patch_pos_embeds.split([h * w for h, w in zip(grid_hs, grid_ws, strict=True)]) ... - for pos_embed, t, h, w in zip(patch_pos_embeds, grid_ts, grid_hs, grid_ws): + for pos_embed, t, h, w in zip(patch_pos_embeds, grid_ts, grid_hs, grid_ws, strict=True):src/megatron/bridge/training/qwen3vl_step.py-41-48 (1)
41-48:⚠️ Potential issue | 🟡 MinorPrefix unused parameters with
_to document intentional non-use.The parameters
use_mtpandis_first_pp_stageare not used in this function's implementation. Ruff correctly flags them as unused. Prefix them with_to comply with the coding guidelines.Suggested fix
def get_batch_from_iterator( data_iterator: Iterable, - use_mtp: bool = False, + _use_mtp: bool = False, skip_getting_attention_mask_from_dataset: bool = True, *, - is_first_pp_stage: bool, + _is_first_pp_stage: bool, is_last_pp_stage: bool, ) -> dict[str, Any]:
🧹 Nitpick comments (6)
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py (1)
57-66: Commented-out code should include an explanation or be removed.The commented-out block (lines 58–66) has a partial note about avoiding deepcopy of
parallel_state, but doesn't explain why this approach was abandoned. As per coding guidelines, commented-out code should either include a clear reason for being kept or be removed before merging.Also, the function is missing a docstring and type annotations for
hf_configand the return type.Proposed improvement
-def get_vision_model_config(config: Qwen3VLTransformerConfig, hf_config): - # # init config from scratch to avoid deepcopy of parallel_state - # config = Qwen3VLTransformerConfig( - # num_layers=hf_config.depth, - # hidden_size=hf_config.hidden_size, - # num_attention_heads=hf_config.num_heads, - # ffn_hidden_size=hf_config.intermediate_size, - # add_bias_linear=True, - # add_qkv_bias=True, - # ) +def get_vision_model_config( + config: Qwen3VLTransformerConfig, hf_config: "Qwen3VLVisionConfig" +) -> Qwen3VLTransformerConfig: + """Configure a Qwen3VL transformer config for the vision encoder. + + Mutates ``config`` in place to override MoE, parallelism, dropout, + and activation settings appropriate for the vision backbone. + + Args: + config: The base transformer config to mutate. + hf_config: HuggingFace vision config providing patch sizes, + embeddings, and layer structure. + + Returns: + The mutated ``config``. + """ config.num_moe_experts = Nonesrc/megatron/bridge/training/utils/packed_seq_utils.py (3)
15-19: Import ordering:math(stdlib) should come beforetorch(third-party).Per coding guidelines, imports should be organized: future imports, standard library, third-party, first-party, separated by blank lines.
Proposed fix
from __future__ import annotations +import math + import torch -import math from megatron.core.packed_seq_params import PackedSeqParams
73-74: Movefrom megatron.core import mputo the top-level imports.Module-level imports placed mid-file are harder to discover and violate standard import organization conventions. The "Copied from" comment is fine to keep as attribution.
Proposed fix
Move to the import section at the top of the file:
+from megatron.core import mpu from megatron.core.packed_seq_params import PackedSeqParamsAnd remove lines 73-74 from the middle of the file.
77-171: Duplicate code withsrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py.
preprocess_packed_seqsandpostprocess_packed_seqsare nearly identical to the versions inutils.py(per relevant snippets). The only differences are theuse_fp8_paddingparameter and theattention_mask is Nonehandling. Having two copies violates DRY and increases the risk of divergence over time.Consider consolidating into a single shared implementation. If that's not feasible now, at minimum add a comment cross-referencing the other copy.
Also,
use_fp8_paddingon line 78 is missing a type annotation (bool).src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py (1)
16-17: UseX | YandT | Noneinstead ofOptional/Unionper coding guidelines.
Optionalis used on line 103 andUnionon lines 92–94. The coding guidelines require Python 3.10+ union syntax.Example fix for PatchMergerSubmodules
-from typing import Optional, Union +from __future__ import annotations ... `@dataclass` class PatchMergerSubmodules: - patch_norm: Union[ModuleSpec, type] = None - linear_fc1: Union[ModuleSpec, type] = None - linear_fc2: Union[ModuleSpec, type] = None + patch_norm: ModuleSpec | type | None = None + linear_fc1: ModuleSpec | type | None = None + linear_fc2: ModuleSpec | type | None = NoneAs per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use 'X | Y' for union types instead of 'Union[X, Y]'".
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py (1)
21-21: UseT | Nonesyntax per coding guidelines.
Optional(line 21) and the implicitOptionalon line 60 should use Python 3.10+ union syntax.-from typing import Literal, Optional +from __future__ import annotations +from typing import Literal ... - pg_collection: ProcessGroupCollection = None, + pg_collection: ProcessGroupCollection | None = None,As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'".
Also applies to: 60-60
| from megatron.core.transformer.attention import * | ||
|
|
||
| from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.rope import apply_rotary_pos_emb_absolute | ||
|
|
||
|
|
||
| class Qwen3VLSelfAttention(SelfAttention): | ||
| """ | ||
| Overrides the SelfAttention class, the difference is that qwen3vl uses apply_rotary_pos_emb_absolute | ||
| instead of apply_rotary_pos_emb | ||
| """ | ||
|
|
||
| def forward( | ||
| self, | ||
| hidden_states: Tensor, | ||
| attention_mask: Tensor, | ||
| key_value_states: Optional[Tensor] = None, | ||
| inference_context: Optional[BaseInferenceContext] = None, | ||
| rotary_pos_emb: Optional[Union[Tensor, Tuple[Tensor, Tensor]]] = None, | ||
| rotary_pos_cos: Optional[Tensor] = None, | ||
| rotary_pos_sin: Optional[Tensor] = None, | ||
| attention_bias: Optional[Tensor] = None, | ||
| packed_seq_params: Optional[PackedSeqParams] = None, | ||
| sequence_len_offset: Optional[int] = None, | ||
| *, | ||
| inference_params: Optional[BaseInferenceContext] = None, | ||
| rotary_pos_cos_sin: Optional[Tensor] = None, | ||
| ) -> Tuple[Tensor, Tensor]: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if the file exists and inspect its current state
find . -path "*qwen_vl/modelling_qwen3_vl/attention.py" -type fRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 139
🏁 Script executed:
# Show the file contents around the specified lines
find . -path "*qwen_vl/modelling_qwen3_vl/attention.py" -type f -exec wc -l {} \;Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 143
🏁 Script executed:
# Read the beginning of the file to see imports and the forward method
find . -path "*qwen_vl/modelling_qwen3_vl/attention.py" -type f -exec head -100 {} \;Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4443
🏁 Script executed:
# Find the megatron.core.transformer.attention module
find . -path "*/megatron/core/transformer/attention.py" -type f | head -1Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Read the megatron.core.transformer.attention module to see what's exported
find . -path "*/megatron/core/transformer/attention.py" -type f -exec head -150 {} \;Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Check if there's an __all__ defined in that module
find . -path "*/megatron/core/transformer/attention.py" -type f -exec grep -n "__all__" {} +Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Scan the full attention.py file to see all usages from the wildcard import
cat ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py | grep -E "^\s*(class|def|[A-Z][A-Za-z0-9_]*)" | head -50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 352
🏁 Script executed:
# Search more carefully for the megatron core module
find . -type f -path "*/megatron/core/transformer/attention.py" 2>/dev/nullRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# List all files in megatron/core/transformer directory
find . -type d -name "megatron" 2>/dev/null | head -5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 85
🏁 Script executed:
# Try searching in different ways
find . -type f -name "attention.py" | grep megatron | grep coreRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Let's directly search the entire file for symbols that would come from the wildcard import
cat ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py | grep -oE "(SelfAttention|BaseInferenceContext|PackedSeqParams|deprecate_inference_params|nvtx_range_push|nvtx_range_pop|is_fa_min_version|HAVE_FA3)" | sort | uniqRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 206
🏁 Script executed:
# Get a comprehensive view of the entire file to see all references
cat ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 11321
🏁 Script executed:
# Verify the typing imports are being used correctly - check what Tensor is
rg "Tensor" ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1040
🏁 Script executed:
# Verify Python version requirements and check if tuple vs Tuple is required
head -5 ./pyproject.toml 2>/dev/null || head -5 ./setup.py 2>/dev/null || grep -r "python_requires" . --include="*.py" | head -1Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 311
🏁 Script executed:
# Verify Tensor source - check if it's torch.Tensor
rg "from torch" ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/ || echo "No explicit torch import found"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 742
🏁 Script executed:
# Check the coding guidelines to confirm tuple vs Tuple requirement
grep -A5 -B5 "built-in generics" ./CODING_GUIDELINES.mdRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 420
🏁 Script executed:
# Verify the Python version requirement
cat ./pyproject.toml | grep -A10 "python"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 341
Replace the wildcard import and update type annotations to use PEP 604 syntax.
The import * from megatron.core.transformer.attention violates linting rules (F403/F405), hides symbol sources, and makes the module fragile to upstream changes. Additionally, the function signature uses deprecated typing syntax (Optional, Union, Tuple) that conflicts with the project's Python 3.10+ requirement and coding guidelines, which mandate T | None, X | Y, and built-in generics (tuple, list, dict) respectively.
Replace the wildcard import with explicit symbols (SelfAttention, BaseInferenceContext, PackedSeqParams, deprecate_inference_params, nvtx_range_push, nvtx_range_pop, is_fa_min_version, HAVE_FA3) and import Tensor from torch. Update all type annotations in the forward method signature and docstring to use modern syntax:
Optional[Tensor]→Tensor | NoneUnion[Tensor, Tuple[Tensor, Tensor]]→Tensor | tuple[Tensor, Tensor]Tuple[Tensor, Tensor]→tuple[Tensor, Tensor]
Suggested update
-from megatron.core.transformer.attention import *
+from megatron.core.transformer.attention import (
+ SelfAttention,
+ BaseInferenceContext,
+ PackedSeqParams,
+ deprecate_inference_params,
+ nvtx_range_push,
+ nvtx_range_pop,
+ is_fa_min_version,
+ HAVE_FA3,
+)
+from torch import Tensor
from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.rope import apply_rotary_pos_emb_absolute
...
- key_value_states: Optional[Tensor] = None,
+ key_value_states: Tensor | None = None,
inference_context: Optional[BaseInferenceContext] = None,
- rotary_pos_emb: Optional[Union[Tensor, Tuple[Tensor, Tensor]]] = None,
+ rotary_pos_emb: Tensor | tuple[Tensor, Tensor] | None = None,
- rotary_pos_cos: Optional[Tensor] = None,
+ rotary_pos_cos: Tensor | None = None,
- rotary_pos_sin: Optional[Tensor] = None,
+ rotary_pos_sin: Tensor | None = None,
- attention_bias: Optional[Tensor] = None,
+ attention_bias: Tensor | None = None,
- packed_seq_params: Optional[PackedSeqParams] = None,
+ packed_seq_params: PackedSeqParams | None = None,
- sequence_len_offset: Optional[int] = None,
+ sequence_len_offset: int | None = None,
*,
- inference_params: Optional[BaseInferenceContext] = None,
+ inference_params: BaseInferenceContext | None = None,
- rotary_pos_cos_sin: Optional[Tensor] = None,
- ) -> Tuple[Tensor, Tensor]:
+ rotary_pos_cos_sin: Tensor | None = None,
+ ) -> tuple[Tensor, Tensor]:🧰 Tools
🪛 Ruff (0.15.0)
[error] 16-16: from megatron.core.transformer.attention import * used; unable to detect undefined names
(F403)
[error] 21-21: SelfAttention may be undefined, or defined from star imports
(F405)
[error] 29-29: Tensor may be undefined, or defined from star imports
(F405)
[error] 30-30: Tensor may be undefined, or defined from star imports
(F405)
[error] 31-31: Optional may be undefined, or defined from star imports
(F405)
[error] 31-31: Tensor may be undefined, or defined from star imports
(F405)
[error] 32-32: Optional may be undefined, or defined from star imports
(F405)
[error] 32-32: BaseInferenceContext may be undefined, or defined from star imports
(F405)
[error] 33-33: Optional may be undefined, or defined from star imports
(F405)
[error] 33-33: Union may be undefined, or defined from star imports
(F405)
[error] 33-33: Tensor may be undefined, or defined from star imports
(F405)
[error] 33-33: Tuple may be undefined, or defined from star imports
(F405)
[error] 33-33: Tensor may be undefined, or defined from star imports
(F405)
[error] 33-33: Tensor may be undefined, or defined from star imports
(F405)
[error] 34-34: Optional may be undefined, or defined from star imports
(F405)
[error] 34-34: Tensor may be undefined, or defined from star imports
(F405)
[error] 35-35: Optional may be undefined, or defined from star imports
(F405)
[error] 35-35: Tensor may be undefined, or defined from star imports
(F405)
[error] 36-36: Optional may be undefined, or defined from star imports
(F405)
[error] 36-36: Tensor may be undefined, or defined from star imports
(F405)
[error] 37-37: Optional may be undefined, or defined from star imports
(F405)
[error] 37-37: PackedSeqParams may be undefined, or defined from star imports
(F405)
[error] 38-38: Optional may be undefined, or defined from star imports
(F405)
[error] 40-40: Optional may be undefined, or defined from star imports
(F405)
[error] 40-40: BaseInferenceContext may be undefined, or defined from star imports
(F405)
[warning] 41-41: Unused method argument: rotary_pos_cos_sin
(ARG002)
[error] 41-41: Optional may be undefined, or defined from star imports
(F405)
[error] 41-41: Tensor may be undefined, or defined from star imports
(F405)
[error] 42-42: Tuple may be undefined, or defined from star imports
(F405)
[error] 42-42: Tensor may be undefined, or defined from star imports
(F405)
[error] 42-42: Tensor may be undefined, or defined from star imports
(F405)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py` around
lines 16 - 42, The file uses a wildcard import from
megatron.core.transformer.attention and old typing forms; replace "from
megatron.core.transformer.attention import *" with explicit imports:
SelfAttention, BaseInferenceContext, PackedSeqParams,
deprecate_inference_params, nvtx_range_push, nvtx_range_pop, is_fa_min_version,
HAVE_FA3, and import Tensor from torch; then update the
Qwen3VLSelfAttention.forward signature and any related docstrings to modern PEP
604 and built-in generics syntax (e.g., Optional[Tensor] -> Tensor | None,
Union[Tensor, Tuple[Tensor, Tensor]] -> Tensor | tuple[Tensor, Tensor],
Tuple[Tensor, Tensor] -> tuple[Tensor, Tensor]) so symbols like forward,
SelfAttention, BaseInferenceContext, and PackedSeqParams use the new
annotations.
| seq_len_interpolation_factor: Optional[float] = None, | ||
| mtp_block_spec: Optional[ModuleSpec] = None, | ||
| vp_stage: Optional[int] = None, | ||
| pg_collection: ProcessGroupCollection = None, |
There was a problem hiding this comment.
pg_collection.cp will crash with AttributeError when pg_collection is None.
Line 60 defaults pg_collection to None, but line 91 unconditionally accesses pg_collection.cp. Any caller that omits pg_collection will hit AttributeError: 'NoneType' object has no attribute 'cp'.
Either make pg_collection a required parameter (remove the default) or add a guard before accessing .cp.
Option A: Make it required (preferred if always needed)
- pg_collection: ProcessGroupCollection = None,
+ pg_collection: ProcessGroupCollection,Option B: Guard the access
self.rotary_pos_emb = Qwen3VLMultimodalRotaryEmbedding(
kv_channels=self.config.kv_channels,
rotary_percent=rotary_percent,
rotary_interleaved=self.config.rotary_interleaved,
seq_len_interpolation_factor=seq_len_interpolation_factor,
rotary_base=rotary_base,
- cp_group=pg_collection.cp,
+ cp_group=pg_collection.cp if pg_collection is not None else None,
)Also applies to: 84-92
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py` at line
60, The parameter pg_collection currently defaults to None but the code later
accesses pg_collection.cp unguarded, which will raise AttributeError; fix by
making pg_collection required (remove the default None from the parameter
declaration for pg_collection: ProcessGroupCollection) so callers must pass a
valid ProcessGroupCollection, or alternatively add a guard before any access to
pg_collection.cp (e.g., check if pg_collection is not None and handle the None
case or raise a clear ValueError), and update all usages in the
constructor/function where pg_collection is referenced (search for pg_collection
and pg_collection.cp around the constructor or methods at the top of
text_model.py) to ensure all accesses are protected or callers are updated.
| class Qwen3VLVisionTransformerBlock(TransformerBlock): | ||
| def __init__( | ||
| self, | ||
| config: Qwen3VLTransformerConfig, | ||
| spec: Union[TransformerBlockSubmodules, ModuleSpec], | ||
| post_layer_norm: bool = True, | ||
| pre_process: bool = True, | ||
| post_process: bool = True, | ||
| # model_comm_pgs: ModelCommProcessGroups = None, | ||
| vp_stage: Optional[int] = None, | ||
| patch_merger_spec: ModuleSpec = None, | ||
| pg_collection: ProcessGroupCollection = None, | ||
| ): | ||
| assert post_process and pre_process, "not support pp for deepstack_merger_list" | ||
| super().__init__( | ||
| config=config, | ||
| spec=spec, | ||
| post_layer_norm=post_layer_norm, | ||
| pre_process=pre_process, | ||
| post_process=post_process, | ||
| # model_comm_pgs=model_comm_pgs, | ||
| vp_stage=vp_stage, | ||
| pg_collection=pg_collection, | ||
| ) | ||
| self.pg_collection = pg_collection | ||
| self.cp_group = pg_collection.cp | ||
| self.tp_group = pg_collection.tp | ||
| self.pp_group = pg_collection.pp |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, verify the file exists and read the relevant section
cat -n src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py | head -100Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4545
🏁 Script executed:
# Search for usages of Qwen3VLVisionTransformerBlock to see how it's instantiated
rg "Qwen3VLVisionTransformerBlock" --type py -B 2 -A 5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2769
🏁 Script executed:
# Check if TransformerBlock has any validation or pg_collection handling
rg "class TransformerBlock" --type py -A 20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Search for how pg_collection is used in similar patterns
rg "pg_collection\s*=\s*None" --type pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Get the full instantiation context
rg "self.decoder = Qwen3VLVisionTransformerBlock" --type py -A 10Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1228
🏁 Script executed:
# Check TransformerBlock parent class from megatron.core
fd "transformer_block.py" --type f | grep -E "megatron.*core"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Search for any validation of pg_collection in the codebase
rg "pg_collection.*is.*None|if pg_collection" --type py -B 2 -A 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4394
🏁 Script executed:
# Find the Qwen3VLVisionModel class and see how pg_collection is set
rg "class Qwen3VLVisionModel" --type py -A 50 | head -100Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5839
🏁 Script executed:
# Check if there's parent class initialization that sets pg_collection
rg "self.pg_collection\s*=" --type py -B 3 -A 1 src/megatron/bridge/models/qwen_vl/Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2367
🏁 Script executed:
# Look at the model_provider.py pattern to understand how pg_collection should be handled
cat -n src/megatron/bridge/models/model_provider.py | grep -A 10 "pg_collection is None"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 698
Add validation for pg_collection parameter before dereferencing.
The pg_collection parameter defaults to None (line 67) but is dereferenced immediately at lines 81-83 without null checking (pg_collection.cp, .tp, .pp). This will raise AttributeError at runtime if the parameter is not provided. Either make pg_collection required (remove the default value), initialize it when None (e.g., via ProcessGroupCollection.use_mpu_process_groups()), or add explicit validation with a meaningful error message.
Note: The parent class Qwen3VLVisionModel has the same issue and should be fixed as well.
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py`
around lines 56 - 83, The constructor Qwen3VLVisionTransformerBlock.__init__
dereferences pg_collection into cp_group/tp_group/pp_group without null
checking; update the init to validate and handle a missing pg_collection by
either requiring it (remove the default None) or initializing it when None
(e.g., call ProcessGroupCollection.use_mpu_process_groups() or another factory)
and then assign self.cp_group/self.tp_group/self.pp_group, or raise a clear
ValueError if pg_collection is None; apply the same null-check/fix pattern to
Qwen3VLVisionModel.__init__ to prevent AttributeError.
| for name, module in self.named_children(): | ||
| if (not module is self.layers) and (not module is self.deepstack_merger_list): | ||
| sharded_state_dict.update( | ||
| sharded_state_dict_default(module, f"{prefix}{name}.", sharded_offsets, metadata) |
There was a problem hiding this comment.
Fix missing import and identity comparison.
sharded_state_dict_default is referenced but not imported (F821), and not module is should be module is not (E714). Both will fail lint/CI.
Suggested fix
-from megatron.core.dist_checkpointing.utils import replace_prefix_for_sharding
+from megatron.core.dist_checkpointing.utils import replace_prefix_for_sharding, sharded_state_dict_default
...
- for name, module in self.named_children():
- if (not module is self.layers) and (not module is self.deepstack_merger_list):
+ for name, module in self.named_children():
+ if module is not self.layers and module is not self.deepstack_merger_list:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for name, module in self.named_children(): | |
| if (not module is self.layers) and (not module is self.deepstack_merger_list): | |
| sharded_state_dict.update( | |
| sharded_state_dict_default(module, f"{prefix}{name}.", sharded_offsets, metadata) | |
| for name, module in self.named_children(): | |
| if module is not self.layers and module is not self.deepstack_merger_list: | |
| sharded_state_dict.update( | |
| sharded_state_dict_default(module, f"{prefix}{name}.", sharded_offsets, metadata) |
🧰 Tools
🪛 Flake8 (7.3.0)
[error] 432-432: undefined name 'sharded_state_dict_default'
(F821)
🪛 Ruff (0.15.0)
[error] 430-430: Test for object identity should be is not
Convert to is not
(E714)
[error] 430-430: Test for object identity should be is not
Convert to is not
(E714)
[error] 432-432: Undefined name sharded_state_dict_default
(F821)
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py`
around lines 429 - 432, The loop uses the undefined symbol
sharded_state_dict_default and an incorrect identity comparison syntax; import
sharded_state_dict_default at the top of transformer_block.py (where other
sharded helpers are imported) and change the conditional from "if (not module is
self.layers) and (not module is self.deepstack_merger_list):" to use proper
identity checks "if (module is not self.layers) and (module is not
self.deepstack_merger_list):" so sharded_state_dict_default(module, ...)
resolves and lint errors E714/F821 are fixed.
| self.linear_fc2 = build_module( | ||
| submodules.linear_fc2, | ||
| self.hidden_size, | ||
| self.config.out_hidden_size, | ||
| config=self.config, | ||
| init_method=self.config.output_layer_init_method, | ||
| bias=self.config.add_bias_linear, | ||
| input_is_parallel=True, | ||
| skip_bias_add=False, | ||
| is_expert=False, | ||
| tp_comm_buffer_name="patch_fc1", | ||
| tp_group=tp_group, | ||
| ) |
There was a problem hiding this comment.
tp_comm_buffer_name for linear_fc2 is "patch_fc1" — likely a copy-paste error.
Both linear_fc1 (line 131) and linear_fc2 (line 147) use the same buffer name "patch_fc1". The second linear should use a distinct name like "patch_fc2" to avoid communication buffer collisions.
Fix
self.linear_fc2 = build_module(
submodules.linear_fc2,
self.hidden_size,
self.config.out_hidden_size,
config=self.config,
init_method=self.config.output_layer_init_method,
bias=self.config.add_bias_linear,
input_is_parallel=True,
skip_bias_add=False,
is_expert=False,
- tp_comm_buffer_name="patch_fc1",
+ tp_comm_buffer_name="patch_fc2",
tp_group=tp_group,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.linear_fc2 = build_module( | |
| submodules.linear_fc2, | |
| self.hidden_size, | |
| self.config.out_hidden_size, | |
| config=self.config, | |
| init_method=self.config.output_layer_init_method, | |
| bias=self.config.add_bias_linear, | |
| input_is_parallel=True, | |
| skip_bias_add=False, | |
| is_expert=False, | |
| tp_comm_buffer_name="patch_fc1", | |
| tp_group=tp_group, | |
| ) | |
| self.linear_fc2 = build_module( | |
| submodules.linear_fc2, | |
| self.hidden_size, | |
| self.config.out_hidden_size, | |
| config=self.config, | |
| init_method=self.config.output_layer_init_method, | |
| bias=self.config.add_bias_linear, | |
| input_is_parallel=True, | |
| skip_bias_add=False, | |
| is_expert=False, | |
| tp_comm_buffer_name="patch_fc2", | |
| tp_group=tp_group, | |
| ) |
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py` around lines
137 - 149, The tp_comm_buffer_name for the second MLP linear layer is wrong: in
the build_module call that constructs self.linear_fc2 you should change
tp_comm_buffer_name from "patch_fc1" to a distinct name (e.g. "patch_fc2") to
match the corresponding first layer (self.linear_fc1) and avoid comm buffer
collisions; update the build_module invocation for linear_fc2 so
tp_comm_buffer_name="patch_fc2".
| return new_pixel_values, new_image_grid_thws, cp_img_num, images_padded | ||
|
|
||
|
|
||
| @torch.no_grad |
There was a problem hiding this comment.
@torch.no_grad missing parentheses — should be @torch.no_grad().
Without (), torch.no_grad receives the decorated function as its argument rather than acting as a context-managed decorator. This will likely silently produce unexpected behavior or a runtime error.
Fix
-@torch.no_grad
+@torch.no_grad()
def qwen3vl_cp_split(📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @torch.no_grad | |
| `@torch.no_grad`() | |
| def qwen3vl_cp_split( |
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py` at line 493,
The decorator currently uses `@torch.no_grad` without calling it, which passes the
function into no_grad instead of applying the context manager; update the
decorator to `@torch.no_grad`() on the function in modelling_qwen3_vl/utils.py so
the no-grad context is correctly applied (search for the decorated function
directly below the `@torch.no_grad` line and replace the decorator with
`@torch.no_grad`()).
| def __init__( | ||
| self, | ||
| transformer_config: Qwen3VLTransformerConfig, | ||
| transformer_layer_spec: ModuleSpec, | ||
| patch_merger_spec: ModuleSpec, | ||
| pre_process: bool = True, | ||
| post_process: bool = True, | ||
| pg_collection: ProcessGroupCollection = None, | ||
| ) -> None: | ||
| assert post_process and pre_process, "not support pp for deepstack_merger_list" | ||
| super().__init__(config=transformer_config) | ||
| self.spatial_merge_size = transformer_config.spatial_merge_size | ||
| self.patch_size = transformer_config.patch_size | ||
| self.spatial_merge_unit = self.spatial_merge_size * self.spatial_merge_size | ||
| self.pg_collection = pg_collection | ||
| self.tp_group = self.pg_collection.tp | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, examine the file to confirm the constructor signature
cat -n src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py | head -60Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3232
🏁 Script executed:
# Search for all instantiations of Qwen3VLVisionModel
rg -n "Qwen3VLVisionModel\s*\(" -g '*.py' -A 5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1442
🏁 Script executed:
# Also search for how pg_collection is typically passed in this codebase
rg -n "pg_collection\s*=" -g '*.py' -B 3 -A 1 | head -100Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 8803
🏁 Script executed:
sed -n '125,145p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1315
🏁 Script executed:
# Check the context around line 128 in model.py to see where pg_collection comes from
sed -n '100,135p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1872
🏁 Script executed:
# Also check the function signature that contains this call
grep -B 20 "self.vision_model = Qwen3VLVisionModel" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py | head -30Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1565
🏁 Script executed:
# Check if there are any other calls to Qwen3VLVisionModel beyond model.py
rg -n "Qwen3VLVisionModel\(" -g '*.py' -B 2 -A 6Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2086
Validate pg_collection before dereferencing.
pg_collection defaults to None on line 42, but line 50 immediately accesses .tp without checking. If a caller instantiates Qwen3VLVisionModel without passing pg_collection, this will crash at construction time. Make the parameter required or provide a safe default.
Suggested fix
- self.pg_collection = pg_collection
- self.tp_group = self.pg_collection.tp
+ if pg_collection is None:
+ raise ValueError("pg_collection must be provided for Qwen3VLVisionModel")
+ self.pg_collection = pg_collection
+ self.tp_group = self.pg_collection.tp🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py` around
lines 35 - 51, The constructor __init__ of Qwen3VLVisionModel dereferences
pg_collection.tp without checking pg_collection (pg_collection defaults to
None), so either require pg_collection be passed or validate it: update __init__
to assert or raise a clear ValueError if pg_collection is None (e.g.,
"pg_collection must be provided"), and only set self.tp_group =
self.pg_collection.tp after that check (or remove the default None from the
pg_collection parameter to make it required); ensure the validation and
assignment occur before any use of self.tp_group.
| # QKV mapping for vision model | ||
| ConcatenatedQKVMapping( | ||
| megatron_param="vision_model.decoder.layers.*.self_attention.linear_qkv.weight", | ||
| hf_param="model.visual.blocks.*.attn.qkv.weight", | ||
| ), | ||
| ConcatenatedQKVMapping( | ||
| megatron_param="vision_model.decoder.layers.*.self_attention.linear_qkv.bias", | ||
| hf_param="model.visual.blocks.*.attn.qkv.bias", | ||
| ), | ||
| ReplicatedMapping( | ||
| megatron_param="vision_model.**", | ||
| hf_param="model.visual.**", | ||
| megatron_param="vision_model.patch_embed.proj.**", | ||
| hf_param="model.visual.patch_embed.proj.**", | ||
| ), | ||
| ReplicatedMapping( | ||
| megatron_param="vision_model.pos_embed.**", | ||
| hf_param="model.visual.pos_embed.**", | ||
| ), |
There was a problem hiding this comment.
Mirror the vision QKV mapping in the MoE bridge.
The dense bridge now uses ConcatenatedQKVMapping for vision QKV, but the MoE bridge still replicates the entire vision module. If the vision QKV shapes differ (as implied by this change), MoE conversions will still be incorrect. Consider applying the same ConcatenatedQKVMapping entries (or refactor to share a common vision mapping block).
🤖 Prompt for AI Agents
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py` around lines 195 -
211, The MoE bridge still uses ReplicatedMapping for the vision module while the
dense bridge switched to ConcatenatedQKVMapping for vision QKV; update the MoE
bridge to mirror those same mappings (add ConcatenatedQKVMapping entries for
vision_model.decoder.layers.*.self_attention.linear_qkv.weight and .bias mapping
to model.visual.blocks.*.attn.qkv.weight and .bias, and keep ReplicatedMapping
for vision_model.patch_embed.proj.** and vision_model.pos_embed.**) or refactor
both bridges to import/share a common vision mapping block so the MoE conversion
uses the same QKV handling as the dense bridge.
| batch_size = input_ids.shape[0] | ||
|
|
||
| seqlens_in_batch = ( | ||
| attention_mask.sum(dim=-1, dtype=torch.int32) | ||
| if attention_mask is not None | ||
| else torch.ones((batch_size), dtype=torch.int32, device=input_ids.device) * input_ids.shape[1] | ||
| ) |
There was a problem hiding this comment.
Potential NoneType indexing error when attention_mask is None and pre_process=True.
Lines 88–92 handle a None attention_mask by computing seqlens_in_batch as all-ones, but attention_mask itself remains None. When pre_process=True, lines 137 and 145 use attention_mask[i] as a boolean index into input_ids, which will raise a TypeError.
Either guard the pre-processing block against None, or materialize a boolean mask when attention_mask is None:
Proposed fix
seqlens_in_batch = (
attention_mask.sum(dim=-1, dtype=torch.int32)
if attention_mask is not None
else torch.ones((batch_size), dtype=torch.int32, device=input_ids.device) * input_ids.shape[1]
)
+ if attention_mask is None:
+ attention_mask = torch.ones(input_ids.shape[:2], dtype=torch.bool, device=input_ids.device)
tp_size = mpu.get_tensor_model_parallel_world_size()Also applies to: 130-145
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/utils/packed_seq_utils.py` around lines 86 - 92,
The code computes seqlens_in_batch when attention_mask is None but leaves
attention_mask as None, which causes a TypeError later when pre_process=True and
the code indexes attention_mask[i]; to fix, when attention_mask is None
initialize a boolean mask (e.g., attention_mask = torch.ones((batch_size,
input_ids.shape[1]), dtype=torch.bool, device=input_ids.device)) and keep
seqlens_in_batch consistent (torch.int32 sums) so subsequent uses like
attention_mask[i] and boolean indexing in the pre-processing block work
correctly; update the branch that creates seqlens_in_batch (and the analogous
130–145 region) to materialize this mask instead of leaving attention_mask as
None.
| is_first = True | ||
| is_last = True |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's find and examine the vlm_step.py file
find . -name "vlm_step.py" -type fRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 113
🏁 Script executed:
#!/bin/bash
# Read the relevant section of vlm_step.py (lines 120-200)
FILE="src/megatron/bridge/training/vlm_step.py"
if [ -f "$FILE" ]; then
echo "=== Lines 120-200 of $FILE ==="
sed -n '120,200p' "$FILE" | cat -n
else
echo "File not found at $FILE"
# Try to find it
find . -name "vlm_step.py" 2>/dev/null | head -5
fiRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4864
🏁 Script executed:
#!/bin/bash
# Check for PP stage variables in the file
FILE="src/megatron/bridge/training/vlm_step.py"
if [ -f "$FILE" ]; then
echo "=== Searching for is_pp_first_stage / is_pp_last_stage ==="
rg "is_pp_first_stage|is_pp_last_stage" "$FILE" -n
echo ""
echo "=== Searching for if True ==="
rg "if True" "$FILE" -n
fiRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 266
🏁 Script executed:
#!/bin/bash
# Check where vlm_forward_step and related functions are used
rg "vlm_forward_step" --type=py -l | head -10Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 310
🏁 Script executed:
#!/bin/bash
# Read more context around the hard-coded values and check for comments
FILE="src/megatron/bridge/training/vlm_step.py"
sed -n '100,130p' "$FILE" | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1224
🏁 Script executed:
#!/bin/bash
# Check if there are any comments explaining the PP behavior or hard-coding
FILE="src/megatron/bridge/training/vlm_step.py"
sed -n '1,150p' "$FILE" | cat -n | tail -70Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3225
🏁 Script executed:
#!/bin/bash
# Search for any recent changes or comments about CUDA graphs or PP behavior
rg -i "cuda.*graph|cuda graph|first.*last.*pp|pipeline.*parallel" src/megatron/bridge/training/vlm_step.py -B2 -A2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 458
🏁 Script executed:
#!/bin/bash
# Check the get_batch_from_iterator function to see how is_first and is_last are used
rg -n "def get_batch_from_iterator" --type=pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 368
🏁 Script executed:
#!/bin/bash
# Check if there's a git log showing when these values changed from dynamic to hard-coded
if [ -d ".git" ]; then
git log --oneline -20 -- src/megatron/bridge/training/vlm_step.py
fiRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 109
🏁 Script executed:
#!/bin/bash
# Check what the else branch (lines 159-192) contains and verify indentation
FILE="src/megatron/bridge/training/vlm_step.py"
sed -n '145,195p' "$FILE" | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3294
🏁 Script executed:
#!/bin/bash
# Check the full context of get_batch_from_iterator to understand what is_first/is_last control
FILE="src/megatron/bridge/training/vlm_step.py"
sed -n '41,120p' "$FILE" | cat -nRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3257
Remove dead code (if True:) and document CUDA graph requirement for hard-coded PP stage values.
Lines 124–125 hard-code is_first = True and is_last = True, forcing all PP ranks to load labels and loss_mask from the data iterator (see get_batch_from_iterator line 36–37). The if True: at line 145 makes the else branch (lines 160–207) unreachable dead code; per coding guidelines, remove this dead branch and replace if True: with a proper condition or remove the conditional entirely.
If this hard-coding is intentional for CUDA graph support (per commit "by pass encoder for cuda graph"), add a comment explaining why all PP stages must load labels/loss_mask despite only the last PP stage needing them for loss computation. This clarifies the behavioral change for PP > 1 configurations.
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/vlm_step.py` around lines 124 - 125, The code
currently hard-sets is_first = True and is_last = True and contains an
unconditional "if True:" that makes the else branch unreachable; update the VLM
step so it either (a) restores correct per-rank is_first/is_last logic (remove
the hard-coded assignments and compute them from pipeline parallel rank/size)
and remove the "if True:" so the else branch is used, or (b) if the hard-coding
was intentional for CUDA graph support, leave the behavior but remove the dead
else branch and add a clear comment above the assignments explaining: why all PP
ranks must load labels/loss_mask (reference get_batch_from_iterator where
labels/loss_mask are loaded), why this bypasses the encoder for CUDA graph, and
which specific condition (e.g., CUDA_GRAPH_ENABLED) justifies the behavior;
update the conditional around the former "if True:" to a meaningful check or
delete it and keep only the active path so there is no unreachable code.
1be02af to
4c4ef34
Compare
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
This pr is used with MCore pr3137, in order to pass pg_collection to CudaGraphManager and TECudaGraphHelper.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Enhancements
Documentation