M4 leftover for TE cuda graph by shifangx · Pull Request #2372 · NVIDIA-NeMo/Megatron-Bridge

shifangx · 2026-02-13T14:38:11Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.
This pr is used with MCore pr3137, in order to pass pg_collection to CudaGraphManager and TECudaGraphHelper.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Release Notes

New Features
- Added pretraining example for Qwen3VL with decentralized process group support
- Integrated vision models into Qwen3VL architecture with transformer-based vision processing
- Added sequence packing and context parallelism support for multimodal models
- Enhanced position embeddings for multimodal inputs
Enhancements
- Extended Qwen3VL configuration with vision model options and improved parallelism support
- Improved batch processing utilities for packed sequences and multimodal data
Documentation
- Added example configurations and training scripts for Qwen3VL

copy-pr-bot · 2026-02-13T14:38:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-13T14:52:58Z

📝 Walkthrough

Walkthrough

This PR adds comprehensive Qwen3VL multimodal model support with decentralized process groups, including vision transformer blocks, context-parallel sequence packing, rotary embeddings, distributed training infrastructure, and example scripts for decentralized pretraining.

Changes

Cohort / File(s)	Summary
Example & Configuration `examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py`, `examples/recipes/qwen_vl/conf/qwen3_vl_pretrain_override_example.yaml`	New decentralized process group pretraining example and expanded Qwen3VL training config with parallelism, logging, and profiling settings.
CLI & Script Support `scripts/performance/argument_parser.py`, `scripts/performance/run_script.py`, `examples/recipes/qwen_vl/finetune_qwen_vl.py`	Added "qwen3vl" domain support to CLI argument parser and performance runner; updated finetune script to import qwen3vl_forward_step.
Data Layer `src/megatron/bridge/data/vlm_datasets/hf_provider.py`	Added `pack_sequences_in_batch` boolean field to HFDatasetConversationProvider for batch-level sequence packing toggle.
Model Architecture - Core `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py`, `rope.py`, `text_model.py`, `model.py`, `transformer_config.py`	Introduced Qwen3VLSelfAttention with RoPE skip and flash decode support; replaced MoE text RoPE with multimodal RoPE; added process group collection threading; introduced vision model selection logic and helper function get_vision_model_config.
Vision Components `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py`, `transformer_block.py`, `utils.py`	New Qwen3VLVisionModel with patch embedding and rotary positioning; extended transformer blocks with deepstack visual merging and CP awareness; added 10+ utility functions for vision embedding handling, CP splitting, packed sequence preprocessing, and patch merging.
Model Integration `src/megatron/bridge/models/conversion/param_mapping.py`, `qwen3_vl_bridge.py`, `qwen3_vl_provider.py`	Updated docstring example; expanded parameter mappings for vision QKV/components; added vision layer specs, patch merger support, and use_hf_vision_model flag to provider dataclasses.
Training Infrastructure `src/megatron/bridge/training/qwen3vl_step.py`, `train.py`, `utils/packed_seq_utils.py`, `utils/train_utils.py`, `vlm_step.py`	New qwen3vl_step module with batch preprocessing, sequence packing (thd/bshd), and forward step orchestration; enhanced packed_seq_utils with CP-aware packing/unpacking; added pg_collection threading to training setup; simplified vlm_step PP logic.
Tests `tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_model.py`	Added ProcessGroupCollection instantiation and pg_collection parameter passing to model initialization in multiple test cases.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

PR #1943: Extensive qwen3-vl support with overlapping multimodal implementation across attention, vision model, rope, and utils modules.
PR #2011: Qwen3/VL support with decentralized process group integration, directly related to decentralized_pg example and pg_collection threading.
PR #2220: Qwen3VLModel forward API signature changes and vision parameter integration, overlapping with model.py forward signature expansion.

Suggested reviewers

erhoo82
malay-nagda
yaoyu-33

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.16% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR introduces 3000+ lines of major changes including new vision model architecture and CUDA graph support, but PR description explicitly states no testing information is documented.	Add test results to PR description: numerical regression testing for packed sequence changes, CUDA graph validation, vision model integration testing, and performance benchmarks.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main change: addressing M4 leftovers related to CUDA graph support in Tensor Engine integration.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

src/megatron/bridge/training/utils/train_utils.py (1)

571-584: ⚠️ Potential issue | 🔴 Critical

Remove unsupported pg_collection parameter from track_moe_metrics call.

The track_moe_metrics function in megatron.core does not accept a pg_collection parameter. Passing it will cause a TypeError at runtime. Remove pg_collection=pg_collection from the call at line 583.
src/megatron/bridge/training/vlm_step.py (1)
145-192: 🛠️ Refactor suggestion | 🟠 Major

if True: creates dead code — the else branch is unreachable.

The else block (lines 159–192) will never execute. Per coding guidelines, dead code should be removed before merging unless there's a documented reason to keep it. If the non-PP padding path is being intentionally preserved for future use, add a comment explaining why.
Clean up: remove dead else branch
     # When using pipeline parallelism, ensure fixed shapes equal to cfg.model.seq_length
-    if True:
-        seq_len = cfg.model.seq_length
+    seq_len = cfg.model.seq_length
 
-        tokens_or_input = batch.get("tokens") if batch.get("tokens") is not None else batch.get("input_ids")
-        ...
-    else:
-        # No PP: pad sequence length to nearest multiple of 128 for efficiency (capped at model seq_length)
-        seq_cap = cfg.model.seq_length
-        ...
+    tokens_or_input = batch.get("tokens") if batch.get("tokens") is not None else batch.get("input_ids")
+    tokens_or_input = pad_or_truncate_2d_to_len(tokens_or_input, seq_len, seq_len, pad_value=0)
+    ...
As per coding guidelines: "If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it as debug code before merging".
tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_model.py (1)
314-323: ⚠️ Potential issue | 🔴 Critical

Pass pg_collection to model_no_decoder (line 314) and model_no_pre (line 366).

Both instances omit the pg_collection parameter, but the Qwen3VLModel constructor immediately accesses its attributes (.cp, .tp, .pp, .embd) without null checks. This will cause an AttributeError at runtime.
Lines 314–323
model_no_decoder = Qwen3VLModel(
    vision_transformer_config=vision_transformer_config,
    language_transformer_config=language_transformer_config,
    language_transformer_layer_spec=language_model_layer_spec,
    parallel_output=True,
    pre_process=True,
    post_process=True,
    add_encoder=True,
    add_decoder=False,
    pg_collection=pg_collection,
)
Also applies to: 366–375
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (2)

272-288: ⚠️ Potential issue | 🟡 Minor

Switch to T | None in the forward signature.

The forward parameters use implicit Optional defaults (e.g., position_ids: torch.Tensor = None). Use torch.Tensor | None to comply with typing guidelines for Python 3.10+.

77-112: ⚠️ Potential issue | 🔴 Critical

Require pg_collection or provide a default in the constructor.

pg_collection defaults to None (line 77), but immediately accessed at lines 102-104 (.cp, .tp, .pp) and 112 (.embd) without null checks. This will crash with AttributeError: 'NoneType' object has no attribute if None is passed or the default is relied upon. While line 166 later reassigns from get_pg_collection(self), the constructor fails before reaching that line.

Either make the parameter required or initialize a default from ProcessGroupCollection.use_mpu_process_groups() before using it.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py (1)
54-125: ⚠️ Potential issue | 🟠 Major

Guard cp_group before calling .size().

The cp_group parameter defaults to None but forward() unconditionally calls self.cp_group.size() at line 122. While current usage always provides a valid process group via pg_collection.cp, the parameter signature permits None. Add a guard or make cp_group required to prevent runtime errors.
Suggested fix
-        self.cp_group = cp_group
+        if cp_group is None:
+            raise ValueError("cp_group must be provided for Qwen3VLMultimodalRotaryEmbedding")
+        self.cp_group = cp_group
...
-        if self.cp_group.size() > 1 and not self.is_thd_format:
+        if self.cp_group is not None and self.cp_group.size() > 1 and not self.is_thd_format:

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py`:
- Around line 16-42: The file uses a wildcard import from
megatron.core.transformer.attention and old typing forms; replace "from
megatron.core.transformer.attention import *" with explicit imports:
SelfAttention, BaseInferenceContext, PackedSeqParams,
deprecate_inference_params, nvtx_range_push, nvtx_range_pop, is_fa_min_version,
HAVE_FA3, and import Tensor from torch; then update the
Qwen3VLSelfAttention.forward signature and any related docstrings to modern PEP
604 and built-in generics syntax (e.g., Optional[Tensor] -> Tensor | None,
Union[Tensor, Tuple[Tensor, Tensor]] -> Tensor | tuple[Tensor, Tensor],
Tuple[Tensor, Tensor] -> tuple[Tensor, Tensor]) so symbols like forward,
SelfAttention, BaseInferenceContext, and PackedSeqParams use the new
annotations.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py`:
- Line 60: The parameter pg_collection currently defaults to None but the code
later accesses pg_collection.cp unguarded, which will raise AttributeError; fix
by making pg_collection required (remove the default None from the parameter
declaration for pg_collection: ProcessGroupCollection) so callers must pass a
valid ProcessGroupCollection, or alternatively add a guard before any access to
pg_collection.cp (e.g., check if pg_collection is not None and handle the None
case or raise a clear ValueError), and update all usages in the
constructor/function where pg_collection is referenced (search for pg_collection
and pg_collection.cp around the constructor or methods at the top of
text_model.py) to ensure all accesses are protected or callers are updated.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py`:
- Around line 429-432: The loop uses the undefined symbol
sharded_state_dict_default and an incorrect identity comparison syntax; import
sharded_state_dict_default at the top of transformer_block.py (where other
sharded helpers are imported) and change the conditional from "if (not module is
self.layers) and (not module is self.deepstack_merger_list):" to use proper
identity checks "if (module is not self.layers) and (module is not
self.deepstack_merger_list):" so sharded_state_dict_default(module, ...)
resolves and lint errors E714/F821 are fixed.
- Around line 56-83: The constructor Qwen3VLVisionTransformerBlock.__init__
dereferences pg_collection into cp_group/tp_group/pp_group without null
checking; update the init to validate and handle a missing pg_collection by
either requiring it (remove the default None) or initializing it when None
(e.g., call ProcessGroupCollection.use_mpu_process_groups() or another factory)
and then assign self.cp_group/self.tp_group/self.pp_group, or raise a clear
ValueError if pg_collection is None; apply the same null-check/fix pattern to
Qwen3VLVisionModel.__init__ to prevent AttributeError.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py`:
- Around line 137-149: The tp_comm_buffer_name for the second MLP linear layer
is wrong: in the build_module call that constructs self.linear_fc2 you should
change tp_comm_buffer_name from "patch_fc1" to a distinct name (e.g.
"patch_fc2") to match the corresponding first layer (self.linear_fc1) and avoid
comm buffer collisions; update the build_module invocation for linear_fc2 so
tp_comm_buffer_name="patch_fc2".
- Around line 365-384: In split_data_cp_rank, the early None guard is
unreachable because val.shape is accessed before checking val is None; move the
"if val is None: return val" to the top of the function (before any assertions
or val.shape accesses), then perform the existing assertions (assert cp_size > 1
and the seq length divisibility) and the rest of the logic (view, index_select,
reshape) unchanged; ensure you still default cp_rank via
mpu.get_context_parallel_rank() after the None check so behavior of
split_data_cp_rank remains identical for non-None tensors.
- Line 493: The decorator currently uses `@torch.no_grad` without calling it,
which passes the function into no_grad instead of applying the context manager;
update the decorator to `@torch.no_grad`() on the function in
modelling_qwen3_vl/utils.py so the no-grad context is correctly applied (search
for the decorated function directly below the `@torch.no_grad` line and replace
the decorator with `@torch.no_grad`()).

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py`:
- Around line 35-51: The constructor __init__ of Qwen3VLVisionModel dereferences
pg_collection.tp without checking pg_collection (pg_collection defaults to
None), so either require pg_collection be passed or validate it: update __init__
to assert or raise a clear ValueError if pg_collection is None (e.g.,
"pg_collection must be provided"), and only set self.tp_group =
self.pg_collection.tp after that check (or remove the default None from the
pg_collection parameter to make it required); ensure the validation and
assignment occur before any use of self.tp_group.

In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py`:
- Around line 195-211: The MoE bridge still uses ReplicatedMapping for the
vision module while the dense bridge switched to ConcatenatedQKVMapping for
vision QKV; update the MoE bridge to mirror those same mappings (add
ConcatenatedQKVMapping entries for
vision_model.decoder.layers.*.self_attention.linear_qkv.weight and .bias mapping
to model.visual.blocks.*.attn.qkv.weight and .bias, and keep ReplicatedMapping
for vision_model.patch_embed.proj.** and vision_model.pos_embed.**) or refactor
both bridges to import/share a common vision mapping block so the MoE conversion
uses the same QKV handling as the dense bridge.

In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 86-92: The code computes seqlens_in_batch when attention_mask is
None but leaves attention_mask as None, which causes a TypeError later when
pre_process=True and the code indexes attention_mask[i]; to fix, when
attention_mask is None initialize a boolean mask (e.g., attention_mask =
torch.ones((batch_size, input_ids.shape[1]), dtype=torch.bool,
device=input_ids.device)) and keep seqlens_in_batch consistent (torch.int32
sums) so subsequent uses like attention_mask[i] and boolean indexing in the
pre-processing block work correctly; update the branch that creates
seqlens_in_batch (and the analogous 130–145 region) to materialize this mask
instead of leaving attention_mask as None.

In `@src/megatron/bridge/training/vlm_step.py`:
- Around line 124-125: The code currently hard-sets is_first = True and is_last
= True and contains an unconditional "if True:" that makes the else branch
unreachable; update the VLM step so it either (a) restores correct per-rank
is_first/is_last logic (remove the hard-coded assignments and compute them from
pipeline parallel rank/size) and remove the "if True:" so the else branch is
used, or (b) if the hard-coding was intentional for CUDA graph support, leave
the behavior but remove the dead else branch and add a clear comment above the
assignments explaining: why all PP ranks must load labels/loss_mask (reference
get_batch_from_iterator where labels/loss_mask are loaded), why this bypasses
the encoder for CUDA graph, and which specific condition (e.g.,
CUDA_GRAPH_ENABLED) justifies the behavior; update the conditional around the
former "if True:" to a meaningful check or delete it and keep only the active
path so there is no unreachable code.

🟡 Minor comments (22)

src/megatron/bridge/models/qwen_vl/qwen3_vl_provider.py-39-43 (1)
39-43: ⚠️ Potential issue | 🟡 Minor

Remove unused import get_vision_model_config.

Static analysis confirms get_vision_model_config (line 39) is imported but not used in this file. If it's needed by external consumers re-exporting from this module, add it to __all__; otherwise remove it.
Proposed fix (if not re-exported)
-from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.transformer_config import get_vision_model_config
 from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.utils import PatchMergerSubmodules
examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py-40-56 (1)
40-56: ⚠️ Potential issue | 🟡 Minor

Misleading comment: says "4B" but uses the 30B A3B config.

Line 42 says "Get the standard Qwen3 4B pretrain config" but calls qwen3_vl_30b_a3b_pretrain_config, which is for the 30B-A3B model.
Proposed fix
 def main() -> None:
     """Run Qwen3 pretraining with decentralized process groups enabled."""
-    # Get the standard Qwen3 4B pretrain config with overrides
+    # Get the standard Qwen3 VL 30B A3B pretrain config with overrides
     cfg = qwen3_vl_30b_a3b_pretrain_config(
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py-15-21 (1)
15-21: ⚠️ Potential issue | 🟡 Minor

Remove unused torch import.

Static analysis flags torch (line 20) as imported but unused. The function uses torch.nn.functional as F but not torch directly.
Proposed fix
 from copy import deepcopy
 from dataclasses import dataclass, field
 from functools import partial
 from typing import List, Optional
 
-import torch
 import torch.nn.functional as F
 from megatron.core.transformer.transformer_config import TransformerConfig
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py-242-242 (1)
242-242: ⚠️ Potential issue | 🟡 Minor

Typo: "flaaten" → "flatten".
-    assert input_ids.dim() == 1, "input_ids should be flaaten"
+    assert input_ids.dim() == 1, "input_ids should be flattened"
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py-348-349 (1)
348-349: ⚠️ Potential issue | 🟡 Minor

Replace assert False with raise ValueError(...) — assertions are removed under python -O.

As the static analysis tool notes, assert False can be silently stripped. Use an explicit exception for unreachable/invalid branches.
Fix
         else:
-            assert False, f"should not have {token_id=}"
+            raise ValueError(f"Unexpected token_id={token_id}")
examples/recipes/qwen_vl/finetune_qwen_vl.py-106-108 (1)
106-108: ⚠️ Potential issue | 🟡 Minor

Remove unused partial import and fix import ordering.

Static analysis flags functools.partial as unused (F401). Additionally, the stdlib import on line 108 should be grouped with other stdlib imports (lines 89–93) per the import ordering guidelines.
Fix
 from megatron.bridge.training.qwen3vl_step import forward_step
 from megatron.bridge.utils.common_utils import get_rank_safe
-from functools import partial
As per coding guidelines: "Organize imports in order: future imports, standard library, third-party, first-party, local folder imports, separated by blank lines".
src/megatron/bridge/training/qwen3vl_step.py-38-38 (1)
38-38: ⚠️ Potential issue | 🟡 Minor

Rename module-level logger to G_LOGGER.

Global variables must use upper snake_case and be prefixed with G.
Suggested fix
-logger = logging.getLogger(__name__)
+G_LOGGER = logging.getLogger(__name__)
As per coding guidelines: Use upper snake_case and prefix 'G' for global variables (e.g., G_MY_GLOBAL).
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py-64-69 (1)

64-69: ⚠️ Potential issue | 🟡 Minor

Remove or justify commented‑out code.

The commented parameter line provides no explanation. Either remove it or add a comment explaining why it is intentionally disabled.

As per coding guidelines: If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-84-93 (1)
84-93: ⚠️ Potential issue | 🟡 Minor

Add an explicit error message for unsupported path.

NotImplementedError() is raised without context, which makes failures opaque. Include a descriptive message.
Suggested fix
-        else:
-            raise NotImplementedError()
+        else:
+            raise NotImplementedError("Qwen3VLVisionModel only supports pre_process=True.")
Based on learnings: In Python implementations, when a feature is not supported, raise an explicit error with a descriptive message.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-185-188 (1)

185-188: ⚠️ Potential issue | 🟡 Minor

Remove or justify commented‑out code.

This commented block lacks a rationale; remove it or explain why it must stay.

As per coding guidelines: If code is commented out, include a comment describing its usage and why it is commented out; otherwise remove it.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-136-149 (1)

136-149: ⚠️ Potential issue | 🟡 Minor

Use print_rank_0 for rank‑safe logging.

Direct print() will spam logs across ranks. Please use print_rank_0 (or a rank‑aware logger).

As per coding guidelines: Use 'print_rank_0' for logging in model bridge to avoid duplicate output across ranks.
src/megatron/bridge/training/qwen3vl_step.py-224-224 (1)
224-224: ⚠️ Potential issue | 🟡 Minor

Remove unused max_seqlen_in_batch.

max_seqlen_in_batch is assigned but never used, which fails lint.
Suggested fix
-        max_seqlen_in_batch = seqlens_in_batch.max().item()
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py-126-127 (1)
126-127: ⚠️ Potential issue | 🟡 Minor

Remove the unnecessary f-string prefix. The string contains no placeholders, so the f is unnecessary.
Suggested fix
-            raise ValueError(f"CUDA graphs must use flash decode with static batching!")
+            raise ValueError("CUDA graphs must use flash decode with static batching!")
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py-363-365 (1)

363-365: ⚠️ Potential issue | 🟡 Minor

Use dict | None for nullable metadata.

Change metadata: dict = None to metadata: dict | None = None to conform with Python 3.10+ type hints per coding guidelines.

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-197-203 (1)

197-203: ⚠️ Potential issue | 🟡 Minor

Update type annotations to use T | None syntax instead of Optional[T].

The forward method has three parameter type annotations that violate the coding guideline:

hidden_states: Optional[torch.Tensor] should be torch.Tensor | None

inference_params: Optional[InferenceParams] = None should be InferenceParams | None

extra_block_kwargs: dict = None should be dict | None

Per coding guidelines: Use 'T | None' for nullable types instead of 'Optional[T]'.
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-1-15 (1)
1-15: ⚠️ Potential issue | 🟡 Minor

Remove unused imports to satisfy lint.

Static analysis correctly flags multiple unused imports: dataclass, Union, MultimodalProjector, MegatronModule, build_module, and get_tensor_model_parallel_group_if_none. These should be removed while keeping Optional and ModuleSpec which are actively used.
Suggested cleanup
-from dataclasses import dataclass
-from typing import Optional, Union
+from typing import Optional
 ...
-from megatron.core.models.vision.multimodal_projector import MultimodalProjector
 ...
-from megatron.core.transformer.module import MegatronModule
-from megatron.core.transformer.spec_utils import ModuleSpec, build_module
-from megatron.core.utils import get_tensor_model_parallel_group_if_none
+from megatron.core.transformer.spec_utils import ModuleSpec
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py-15-19 (1)
15-19: ⚠️ Potential issue | 🟡 Minor

Remove unused mpu import.

mpu is imported on line 17 but is never used in the file.
Suggested fix
-from megatron.core import InferenceParams, mpu, tensor_parallel
+from megatron.core import InferenceParams, tensor_parallel
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py-265-306 (1)
265-306: ⚠️ Potential issue | 🟡 Minor

Add return type annotation and remove unused parameter.

apply_rotary_pos_emb_absolute is missing a return type annotation (line 282). Additionally, cu_seqlens parameter in apply_rotary_pos_emb_thd_absolute (line 265) is unused in the function body and should be removed, along with its reference at the call site (line 302).
Suggested fix
 def apply_rotary_pos_emb_thd_absolute(
-    t: Tensor, cu_seqlens: Tensor, freqs: Tensor, rotary_interleaved: bool = False
+    t: Tensor, freqs: Tensor, rotary_interleaved: bool = False
 ) -> Tensor:
     ...
     return _apply_rotary_pos_emb_bshd(t[:, None], freqs, rotary_interleaved=rotary_interleaved).squeeze(1)

-def apply_rotary_pos_emb_absolute(
+def apply_rotary_pos_emb_absolute(
     t: Tensor,
     freqs: Tensor,
     config: Qwen3VLTransformerConfig,
     cu_seqlens: Optional[Tensor] = None,
-):
+) -> Tensor:
     ...
     if cu_seqlens is None:
         result = _apply_rotary_pos_emb_bshd(t, freqs, rotary_interleaved=config.rotary_interleaved)
     else:
-        result = apply_rotary_pos_emb_thd_absolute(t, cu_seqlens, freqs, rotary_interleaved=config.rotary_interleaved)
+        result = apply_rotary_pos_emb_thd_absolute(t, freqs, rotary_interleaved=config.rotary_interleaved)
     ...
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py-177-185 (1)
177-185: ⚠️ Potential issue | 🟡 Minor

Rename loop variable to avoid shadowing the input_ids parameter.

The loop variable at line 177 shadows the outer input_ids parameter, creating a subtle bug. After the loop exits (line 241), mrope_position_deltas = torch.tensor(mrope_position_deltas, device=input_ids.device) uses the sliced tensor from the last iteration rather than the original parameter. Rename to sample_input_ids and update all references within the loop.
Suggested fix
-        for i, input_ids in enumerate(total_input_ids):
-            input_ids = input_ids[attention_mask[i] == 1]
+        for i, sample_input_ids in enumerate(total_input_ids):
+            sample_input_ids = sample_input_ids[attention_mask[i] == 1]
             image_nums, video_nums = 0, 0
-            vision_start_indices = torch.argwhere(input_ids == vision_start_token_id).squeeze(1)
-            vision_tokens = input_ids[vision_start_indices + 1]
+            vision_start_indices = torch.argwhere(sample_input_ids == vision_start_token_id).squeeze(1)
+            vision_tokens = sample_input_ids[vision_start_indices + 1]
             image_nums = (vision_tokens == image_token_id).sum()
             video_nums = (vision_tokens == video_token_id).sum()
-            input_tokens = input_ids.tolist()
+            input_tokens = sample_input_ids.tolist()
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/rope.py-16-55 (1)
16-55: ⚠️ Potential issue | 🟡 Minor

Use PEP 604 and built‑in generics for type hints.

The code uses List and Optional from the typing module, which conflicts with Python 3.10+ conventions. Update to use list[...] for generic sequences and T | None for nullable types, then remove unused typing imports.
Suggested update
-from typing import List, Optional
+from typing import TYPE_CHECKING
 ...
-        seq_len_interpolation_factor: Optional[float] = None,
+        seq_len_interpolation_factor: float | None = None,
 ...
-        mrope_section: List[int],
+        mrope_section: list[int],
 ...
-    input_ids: Optional[torch.LongTensor] = None,
+    input_ids: torch.LongTensor | None = None,
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py-140-187 (1)
140-187: ⚠️ Potential issue | 🟡 Minor

Add strict=True to all zip() calls and replace unused t variable with _.

Ruff (B905) flags all three zip() calls for missing explicit strict= parameter. Additionally, the t variable at line 140 is unused within the loop body and should be replaced with _ (B007).
Suggested fix
-        for t, h, w in zip(grid_ts, grid_hs, grid_ws):
+        for _, h, w in zip(grid_ts, grid_hs, grid_ws, strict=True):
...
-        patch_pos_embeds = patch_pos_embeds.split([h * w for h, w in zip(grid_hs, grid_ws)])
+        patch_pos_embeds = patch_pos_embeds.split([h * w for h, w in zip(grid_hs, grid_ws, strict=True)])
...
-        for pos_embed, t, h, w in zip(patch_pos_embeds, grid_ts, grid_hs, grid_ws):
+        for pos_embed, t, h, w in zip(patch_pos_embeds, grid_ts, grid_hs, grid_ws, strict=True):
src/megatron/bridge/training/qwen3vl_step.py-41-48 (1)
41-48: ⚠️ Potential issue | 🟡 Minor

Prefix unused parameters with _ to document intentional non-use.

The parameters use_mtp and is_first_pp_stage are not used in this function's implementation. Ruff correctly flags them as unused. Prefix them with _ to comply with the coding guidelines.
Suggested fix
 def get_batch_from_iterator(
     data_iterator: Iterable,
-    use_mtp: bool = False,
+    _use_mtp: bool = False,
     skip_getting_attention_mask_from_dataset: bool = True,
     *,
-    is_first_pp_stage: bool,
+    _is_first_pp_stage: bool,
     is_last_pp_stage: bool,
 ) -> dict[str, Any]:

🧹 Nitpick comments (6)

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_config.py (1)
57-66: Commented-out code should include an explanation or be removed.

The commented-out block (lines 58–66) has a partial note about avoiding deepcopy of parallel_state, but doesn't explain why this approach was abandoned. As per coding guidelines, commented-out code should either include a clear reason for being kept or be removed before merging.

Also, the function is missing a docstring and type annotations for hf_config and the return type.
Proposed improvement
-def get_vision_model_config(config: Qwen3VLTransformerConfig, hf_config):
-    # # init config from scratch to avoid deepcopy of parallel_state
-    # config = Qwen3VLTransformerConfig(
-    #     num_layers=hf_config.depth,
-    #     hidden_size=hf_config.hidden_size,
-    #     num_attention_heads=hf_config.num_heads,
-    #     ffn_hidden_size=hf_config.intermediate_size,
-    #     add_bias_linear=True,
-    #     add_qkv_bias=True,
-    # )
+def get_vision_model_config(
+    config: Qwen3VLTransformerConfig, hf_config: "Qwen3VLVisionConfig"
+) -> Qwen3VLTransformerConfig:
+    """Configure a Qwen3VL transformer config for the vision encoder.
+
+    Mutates ``config`` in place to override MoE, parallelism, dropout,
+    and activation settings appropriate for the vision backbone.
+
+    Args:
+        config: The base transformer config to mutate.
+        hf_config: HuggingFace vision config providing patch sizes,
+            embeddings, and layer structure.
+
+    Returns:
+        The mutated ``config``.
+    """
     config.num_moe_experts = None
src/megatron/bridge/training/utils/packed_seq_utils.py (3)
15-19: Import ordering: math (stdlib) should come before torch (third-party).

Per coding guidelines, imports should be organized: future imports, standard library, third-party, first-party, separated by blank lines.
Proposed fix
 from __future__ import annotations
 
+import math
+
 import torch
-import math
 from megatron.core.packed_seq_params import PackedSeqParams
73-74: Move from megatron.core import mpu to the top-level imports.

Module-level imports placed mid-file are harder to discover and violate standard import organization conventions. The "Copied from" comment is fine to keep as attribution.
Proposed fix

Move to the import section at the top of the file:
+from megatron.core import mpu
 from megatron.core.packed_seq_params import PackedSeqParams
And remove lines 73-74 from the middle of the file.
77-171: Duplicate code with src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py.

preprocess_packed_seqs and postprocess_packed_seqs are nearly identical to the versions in utils.py (per relevant snippets). The only differences are the use_fp8_padding parameter and the attention_mask is None handling. Having two copies violates DRY and increases the risk of divergence over time.

Consider consolidating into a single shared implementation. If that's not feasible now, at minimum add a comment cross-referencing the other copy.

Also, use_fp8_padding on line 78 is missing a type annotation (bool).
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py (1)
16-17: Use X | Y and T | None instead of Optional / Union per coding guidelines.

Optional is used on line 103 and Union on lines 92–94. The coding guidelines require Python 3.10+ union syntax.
Example fix for PatchMergerSubmodules
-from typing import Optional, Union
+from __future__ import annotations

 ...

 `@dataclass`
 class PatchMergerSubmodules:
-    patch_norm: Union[ModuleSpec, type] = None
-    linear_fc1: Union[ModuleSpec, type] = None
-    linear_fc2: Union[ModuleSpec, type] = None
+    patch_norm: ModuleSpec | type | None = None
+    linear_fc1: ModuleSpec | type | None = None
+    linear_fc2: ModuleSpec | type | None = None
As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'" and "Use 'X | Y' for union types instead of 'Union[X, Y]'".
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py (1)
21-21: Use T | None syntax per coding guidelines.

Optional (line 21) and the implicit Optional on line 60 should use Python 3.10+ union syntax.
-from typing import Literal, Optional
+from __future__ import annotations
+from typing import Literal
 ...
-        pg_collection: ProcessGroupCollection = None,
+        pg_collection: ProcessGroupCollection | None = None,
As per coding guidelines: "Use 'T | None' for nullable types instead of 'Optional[T]'".

Also applies to: 60-60

coderabbitai · 2026-02-13T14:53:02Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py

+from megatron.core.transformer.attention import *
+
+from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.rope import apply_rotary_pos_emb_absolute
+
+
+class Qwen3VLSelfAttention(SelfAttention):
+    """
+    Overrides the SelfAttention class, the difference is that qwen3vl uses apply_rotary_pos_emb_absolute
+    instead of apply_rotary_pos_emb
+    """
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        attention_mask: Tensor,
+        key_value_states: Optional[Tensor] = None,
+        inference_context: Optional[BaseInferenceContext] = None,
+        rotary_pos_emb: Optional[Union[Tensor, Tuple[Tensor, Tensor]]] = None,
+        rotary_pos_cos: Optional[Tensor] = None,
+        rotary_pos_sin: Optional[Tensor] = None,
+        attention_bias: Optional[Tensor] = None,
+        packed_seq_params: Optional[PackedSeqParams] = None,
+        sequence_len_offset: Optional[int] = None,
+        *,
+        inference_params: Optional[BaseInferenceContext] = None,
+        rotary_pos_cos_sin: Optional[Tensor] = None,
+    ) -> Tuple[Tensor, Tensor]:


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and inspect its current state find . -path "*qwen_vl/modelling_qwen3_vl/attention.py" -type f

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 139

🏁 Script executed:

# Show the file contents around the specified lines find . -path "*qwen_vl/modelling_qwen3_vl/attention.py" -type f -exec wc -l {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 143

🏁 Script executed:

# Read the beginning of the file to see imports and the forward method find . -path "*qwen_vl/modelling_qwen3_vl/attention.py" -type f -exec head -100 {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4443

🏁 Script executed:

# Find the megatron.core.transformer.attention module find . -path "*/megatron/core/transformer/attention.py" -type f | head -1

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Read the megatron.core.transformer.attention module to see what's exported find . -path "*/megatron/core/transformer/attention.py" -type f -exec head -150 {} \;

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Check if there's an __all__ defined in that module find . -path "*/megatron/core/transformer/attention.py" -type f -exec grep -n "__all__" {} +

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Scan the full attention.py file to see all usages from the wildcard import cat ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py | grep -E "^\s*(class|def|[A-Z][A-Za-z0-9_]*)" | head -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 352

🏁 Script executed:

# Search more carefully for the megatron core module find . -type f -path "*/megatron/core/transformer/attention.py" 2>/dev/null

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# List all files in megatron/core/transformer directory find . -type d -name "megatron" 2>/dev/null | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 85

🏁 Script executed:

# Try searching in different ways find . -type f -name "attention.py" | grep megatron | grep core

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Let's directly search the entire file for symbols that would come from the wildcard import cat ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py | grep -oE "(SelfAttention|BaseInferenceContext|PackedSeqParams|deprecate_inference_params|nvtx_range_push|nvtx_range_pop|is_fa_min_version|HAVE_FA3)" | sort | uniq

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 206

🏁 Script executed:

# Get a comprehensive view of the entire file to see all references cat ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 11321

🏁 Script executed:

# Verify the typing imports are being used correctly - check what Tensor is rg "Tensor" ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1040

🏁 Script executed:

# Verify Python version requirements and check if tuple vs Tuple is required head -5 ./pyproject.toml 2>/dev/null || head -5 ./setup.py 2>/dev/null || grep -r "python_requires" . --include="*.py" | head -1

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 311

🏁 Script executed:

# Verify Tensor source - check if it's torch.Tensor rg "from torch" ./src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/ || echo "No explicit torch import found"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 742

🏁 Script executed:

# Check the coding guidelines to confirm tuple vs Tuple requirement grep -A5 -B5 "built-in generics" ./CODING_GUIDELINES.md

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 420

🏁 Script executed:

# Verify the Python version requirement cat ./pyproject.toml | grep -A10 "python"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 341

Replace the wildcard import and update type annotations to use PEP 604 syntax.

The import * from megatron.core.transformer.attention violates linting rules (F403/F405), hides symbol sources, and makes the module fragile to upstream changes. Additionally, the function signature uses deprecated typing syntax (Optional, Union, Tuple) that conflicts with the project's Python 3.10+ requirement and coding guidelines, which mandate T | None, X | Y, and built-in generics (tuple, list, dict) respectively.

Replace the wildcard import with explicit symbols (SelfAttention, BaseInferenceContext, PackedSeqParams, deprecate_inference_params, nvtx_range_push, nvtx_range_pop, is_fa_min_version, HAVE_FA3) and import Tensor from torch. Update all type annotations in the forward method signature and docstring to use modern syntax:

Optional[Tensor] → Tensor | None

Union[Tensor, Tuple[Tensor, Tensor]] → Tensor | tuple[Tensor, Tensor]

Tuple[Tensor, Tensor] → tuple[Tensor, Tensor]

Suggested update

-from megatron.core.transformer.attention import * +from megatron.core.transformer.attention import ( + SelfAttention, + BaseInferenceContext, + PackedSeqParams, + deprecate_inference_params, + nvtx_range_push, + nvtx_range_pop, + is_fa_min_version, + HAVE_FA3, +) +from torch import Tensor from megatron.bridge.models.qwen_vl.modelling_qwen3_vl.rope import apply_rotary_pos_emb_absolute ... - key_value_states: Optional[Tensor] = None, + key_value_states: Tensor | None = None, inference_context: Optional[BaseInferenceContext] = None, - rotary_pos_emb: Optional[Union[Tensor, Tuple[Tensor, Tensor]]] = None, + rotary_pos_emb: Tensor | tuple[Tensor, Tensor] | None = None, - rotary_pos_cos: Optional[Tensor] = None, + rotary_pos_cos: Tensor | None = None, - rotary_pos_sin: Optional[Tensor] = None, + rotary_pos_sin: Tensor | None = None, - attention_bias: Optional[Tensor] = None, + attention_bias: Tensor | None = None, - packed_seq_params: Optional[PackedSeqParams] = None, + packed_seq_params: PackedSeqParams | None = None, - sequence_len_offset: Optional[int] = None, + sequence_len_offset: int | None = None, *, - inference_params: Optional[BaseInferenceContext] = None, + inference_params: BaseInferenceContext | None = None, - rotary_pos_cos_sin: Optional[Tensor] = None, - ) -> Tuple[Tensor, Tensor]: + rotary_pos_cos_sin: Tensor | None = None, + ) -> tuple[Tensor, Tensor]:

🧰 Tools

🪛 Ruff (0.15.0)

[error] 16-16: from megatron.core.transformer.attention import * used; unable to detect undefined names

(F403)

[error] 21-21: SelfAttention may be undefined, or defined from star imports

(F405)

[error] 29-29: Tensor may be undefined, or defined from star imports

(F405)

[error] 30-30: Tensor may be undefined, or defined from star imports

(F405)

[error] 31-31: Optional may be undefined, or defined from star imports

(F405)

[error] 31-31: Tensor may be undefined, or defined from star imports

(F405)

[error] 32-32: Optional may be undefined, or defined from star imports

(F405)

[error] 32-32: BaseInferenceContext may be undefined, or defined from star imports

(F405)

[error] 33-33: Optional may be undefined, or defined from star imports

(F405)

[error] 33-33: Union may be undefined, or defined from star imports

(F405)

[error] 33-33: Tensor may be undefined, or defined from star imports

(F405)

[error] 33-33: Tuple may be undefined, or defined from star imports

(F405)

[error] 33-33: Tensor may be undefined, or defined from star imports

(F405)

[error] 33-33: Tensor may be undefined, or defined from star imports

(F405)

[error] 34-34: Optional may be undefined, or defined from star imports

(F405)

[error] 34-34: Tensor may be undefined, or defined from star imports

(F405)

[error] 35-35: Optional may be undefined, or defined from star imports

(F405)

[error] 35-35: Tensor may be undefined, or defined from star imports

(F405)

[error] 36-36: Optional may be undefined, or defined from star imports

(F405)

[error] 36-36: Tensor may be undefined, or defined from star imports

(F405)

[error] 37-37: Optional may be undefined, or defined from star imports

(F405)

[error] 37-37: PackedSeqParams may be undefined, or defined from star imports

(F405)

[error] 38-38: Optional may be undefined, or defined from star imports

(F405)

[error] 40-40: Optional may be undefined, or defined from star imports

(F405)

[error] 40-40: BaseInferenceContext may be undefined, or defined from star imports

(F405)

[warning] 41-41: Unused method argument: rotary_pos_cos_sin

(ARG002)

[error] 41-41: Optional may be undefined, or defined from star imports

(F405)

[error] 41-41: Tensor may be undefined, or defined from star imports

(F405)

[error] 42-42: Tuple may be undefined, or defined from star imports

(F405)

[error] 42-42: Tensor may be undefined, or defined from star imports

(F405)

[error] 42-42: Tensor may be undefined, or defined from star imports

(F405)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py` around lines 16 - 42, The file uses a wildcard import from megatron.core.transformer.attention and old typing forms; replace "from megatron.core.transformer.attention import *" with explicit imports: SelfAttention, BaseInferenceContext, PackedSeqParams, deprecate_inference_params, nvtx_range_push, nvtx_range_pop, is_fa_min_version, HAVE_FA3, and import Tensor from torch; then update the Qwen3VLSelfAttention.forward signature and any related docstrings to modern PEP 604 and built-in generics syntax (e.g., Optional[Tensor] -> Tensor | None, Union[Tensor, Tuple[Tensor, Tensor]] -> Tensor | tuple[Tensor, Tensor], Tuple[Tensor, Tensor] -> tuple[Tensor, Tensor]) so symbols like forward, SelfAttention, BaseInferenceContext, and PackedSeqParams use the new annotations.

coderabbitai · 2026-02-13T14:53:02Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py

        seq_len_interpolation_factor: Optional[float] = None,
        mtp_block_spec: Optional[ModuleSpec] = None,
        vp_stage: Optional[int] = None,
+        pg_collection: ProcessGroupCollection = None,


⚠️ Potential issue | 🔴 Critical

pg_collection.cp will crash with AttributeError when pg_collection is None.

Line 60 defaults pg_collection to None, but line 91 unconditionally accesses pg_collection.cp. Any caller that omits pg_collection will hit AttributeError: 'NoneType' object has no attribute 'cp'.

Either make pg_collection a required parameter (remove the default) or add a guard before accessing .cp.

Option A: Make it required (preferred if always needed)

- pg_collection: ProcessGroupCollection = None, + pg_collection: ProcessGroupCollection,

Option B: Guard the access

self.rotary_pos_emb = Qwen3VLMultimodalRotaryEmbedding( kv_channels=self.config.kv_channels, rotary_percent=rotary_percent, rotary_interleaved=self.config.rotary_interleaved, seq_len_interpolation_factor=seq_len_interpolation_factor, rotary_base=rotary_base, - cp_group=pg_collection.cp, + cp_group=pg_collection.cp if pg_collection is not None else None, )

Also applies to: 84-92

🤖 Prompt for AI Agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py` at line 60, The parameter pg_collection currently defaults to None but the code later accesses pg_collection.cp unguarded, which will raise AttributeError; fix by making pg_collection required (remove the default None from the parameter declaration for pg_collection: ProcessGroupCollection) so callers must pass a valid ProcessGroupCollection, or alternatively add a guard before any access to pg_collection.cp (e.g., check if pg_collection is not None and handle the None case or raise a clear ValueError), and update all usages in the constructor/function where pg_collection is referenced (search for pg_collection and pg_collection.cp around the constructor or methods at the top of text_model.py) to ensure all accesses are protected or callers are updated.

coderabbitai · 2026-02-13T14:53:02Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py

+class Qwen3VLVisionTransformerBlock(TransformerBlock):
+    def __init__(
+        self,
+        config: Qwen3VLTransformerConfig,
+        spec: Union[TransformerBlockSubmodules, ModuleSpec],
+        post_layer_norm: bool = True,
+        pre_process: bool = True,
+        post_process: bool = True,
+        # model_comm_pgs: ModelCommProcessGroups = None,
+        vp_stage: Optional[int] = None,
+        patch_merger_spec: ModuleSpec = None,
+        pg_collection: ProcessGroupCollection = None,
+    ):
+        assert post_process and pre_process, "not support pp for deepstack_merger_list"
+        super().__init__(
+            config=config,
+            spec=spec,
+            post_layer_norm=post_layer_norm,
+            pre_process=pre_process,
+            post_process=post_process,
+            # model_comm_pgs=model_comm_pgs,
+            vp_stage=vp_stage,
+            pg_collection=pg_collection,
+        )
+        self.pg_collection = pg_collection
+        self.cp_group = pg_collection.cp
+        self.tp_group = pg_collection.tp
+        self.pp_group = pg_collection.pp


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, verify the file exists and read the relevant section cat -n src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4545

🏁 Script executed:

# Search for usages of Qwen3VLVisionTransformerBlock to see how it's instantiated rg "Qwen3VLVisionTransformerBlock" --type py -B 2 -A 5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2769

🏁 Script executed:

# Check if TransformerBlock has any validation or pg_collection handling rg "class TransformerBlock" --type py -A 20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Search for how pg_collection is used in similar patterns rg "pg_collection\s*=\s*None" --type py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Get the full instantiation context rg "self.decoder = Qwen3VLVisionTransformerBlock" --type py -A 10

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1228

🏁 Script executed:

# Check TransformerBlock parent class from megatron.core fd "transformer_block.py" --type f | grep -E "megatron.*core"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Search for any validation of pg_collection in the codebase rg "pg_collection.*is.*None|if pg_collection" --type py -B 2 -A 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4394

🏁 Script executed:

# Find the Qwen3VLVisionModel class and see how pg_collection is set rg "class Qwen3VLVisionModel" --type py -A 50 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5839

🏁 Script executed:

# Check if there's parent class initialization that sets pg_collection rg "self.pg_collection\s*=" --type py -B 3 -A 1 src/megatron/bridge/models/qwen_vl/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2367

🏁 Script executed:

# Look at the model_provider.py pattern to understand how pg_collection should be handled cat -n src/megatron/bridge/models/model_provider.py | grep -A 10 "pg_collection is None"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 698

Add validation for pg_collection parameter before dereferencing.

The pg_collection parameter defaults to None (line 67) but is dereferenced immediately at lines 81-83 without null checking (pg_collection.cp, .tp, .pp). This will raise AttributeError at runtime if the parameter is not provided. Either make pg_collection required (remove the default value), initialize it when None (e.g., via ProcessGroupCollection.use_mpu_process_groups()), or add explicit validation with a meaningful error message.

Note: The parent class Qwen3VLVisionModel has the same issue and should be fixed as well.

🤖 Prompt for AI Agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py` around lines 56 - 83, The constructor Qwen3VLVisionTransformerBlock.__init__ dereferences pg_collection into cp_group/tp_group/pp_group without null checking; update the init to validate and handle a missing pg_collection by either requiring it (remove the default None) or initializing it when None (e.g., call ProcessGroupCollection.use_mpu_process_groups() or another factory) and then assign self.cp_group/self.tp_group/self.pp_group, or raise a clear ValueError if pg_collection is None; apply the same null-check/fix pattern to Qwen3VLVisionModel.__init__ to prevent AttributeError.

coderabbitai · 2026-02-13T14:53:02Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py

+        for name, module in self.named_children():
+            if (not module is self.layers) and (not module is self.deepstack_merger_list):
+                sharded_state_dict.update(
+                    sharded_state_dict_default(module, f"{prefix}{name}.", sharded_offsets, metadata)


⚠️ Potential issue | 🟠 Major

Fix missing import and identity comparison.

sharded_state_dict_default is referenced but not imported (F821), and not module is should be module is not (E714). Both will fail lint/CI.

Suggested fix

-from megatron.core.dist_checkpointing.utils import replace_prefix_for_sharding +from megatron.core.dist_checkpointing.utils import replace_prefix_for_sharding, sharded_state_dict_default ... - for name, module in self.named_children(): - if (not module is self.layers) and (not module is self.deepstack_merger_list): + for name, module in self.named_children(): + if module is not self.layers and module is not self.deepstack_merger_list:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for name, module in self.named_children():

if (not module is self.layers) and (not module is self.deepstack_merger_list):

sharded_state_dict.update(

sharded_state_dict_default(module, f"{prefix}{name}.", sharded_offsets, metadata)

for name, module in self.named_children():

if module is not self.layers and module is not self.deepstack_merger_list:

sharded_state_dict.update(

sharded_state_dict_default(module, f"{prefix}{name}.", sharded_offsets, metadata)

🧰 Tools

🪛 Flake8 (7.3.0)

[error] 432-432: undefined name 'sharded_state_dict_default'

(F821)

🪛 Ruff (0.15.0)

[error] 430-430: Test for object identity should be is not

Convert to is not

(E714)

[error] 430-430: Test for object identity should be is not

Convert to is not

(E714)

[error] 432-432: Undefined name sharded_state_dict_default

(F821)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py` around lines 429 - 432, The loop uses the undefined symbol sharded_state_dict_default and an incorrect identity comparison syntax; import sharded_state_dict_default at the top of transformer_block.py (where other sharded helpers are imported) and change the conditional from "if (not module is self.layers) and (not module is self.deepstack_merger_list):" to use proper identity checks "if (module is not self.layers) and (module is not self.deepstack_merger_list):" so sharded_state_dict_default(module, ...) resolves and lint errors E714/F821 are fixed.

coderabbitai · 2026-02-13T14:53:02Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py

+        self.linear_fc2 = build_module(
+            submodules.linear_fc2,
+            self.hidden_size,
+            self.config.out_hidden_size,
+            config=self.config,
+            init_method=self.config.output_layer_init_method,
+            bias=self.config.add_bias_linear,
+            input_is_parallel=True,
+            skip_bias_add=False,
+            is_expert=False,
+            tp_comm_buffer_name="patch_fc1",
+            tp_group=tp_group,
+        )


⚠️ Potential issue | 🟠 Major

tp_comm_buffer_name for linear_fc2 is "patch_fc1" — likely a copy-paste error.

Both linear_fc1 (line 131) and linear_fc2 (line 147) use the same buffer name "patch_fc1". The second linear should use a distinct name like "patch_fc2" to avoid communication buffer collisions.

Fix

self.linear_fc2 = build_module( submodules.linear_fc2, self.hidden_size, self.config.out_hidden_size, config=self.config, init_method=self.config.output_layer_init_method, bias=self.config.add_bias_linear, input_is_parallel=True, skip_bias_add=False, is_expert=False, - tp_comm_buffer_name="patch_fc1", + tp_comm_buffer_name="patch_fc2", tp_group=tp_group, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.linear_fc2 = build_module(

submodules.linear_fc2,

self.hidden_size,

self.config.out_hidden_size,

config=self.config,

init_method=self.config.output_layer_init_method,

bias=self.config.add_bias_linear,

input_is_parallel=True,

skip_bias_add=False,

is_expert=False,

tp_comm_buffer_name="patch_fc1",

tp_group=tp_group,

)

self.linear_fc2 = build_module(

submodules.linear_fc2,

self.hidden_size,

self.config.out_hidden_size,

config=self.config,

init_method=self.config.output_layer_init_method,

bias=self.config.add_bias_linear,

input_is_parallel=True,

skip_bias_add=False,

is_expert=False,

tp_comm_buffer_name="patch_fc2",

tp_group=tp_group,

)

🤖 Prompt for AI Agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py` around lines 137 - 149, The tp_comm_buffer_name for the second MLP linear layer is wrong: in the build_module call that constructs self.linear_fc2 you should change tp_comm_buffer_name from "patch_fc1" to a distinct name (e.g. "patch_fc2") to match the corresponding first layer (self.linear_fc1) and avoid comm buffer collisions; update the build_module invocation for linear_fc2 so tp_comm_buffer_name="patch_fc2".

coderabbitai · 2026-02-13T14:53:03Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py

+    return new_pixel_values, new_image_grid_thws, cp_img_num, images_padded
+
+
+@torch.no_grad


⚠️ Potential issue | 🔴 Critical

@torch.no_grad missing parentheses — should be @torch.no_grad().

Without (), torch.no_grad receives the decorated function as its argument rather than acting as a context-managed decorator. This will likely silently produce unexpected behavior or a runtime error.

Fix

-@torch.no_grad +@torch.no_grad() def qwen3vl_cp_split(

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@torch.no_grad

`@torch.no_grad`()

def qwen3vl_cp_split(

🤖 Prompt for AI Agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py` at line 493, The decorator currently uses `@torch.no_grad` without calling it, which passes the function into no_grad instead of applying the context manager; update the decorator to `@torch.no_grad`() on the function in modelling_qwen3_vl/utils.py so the no-grad context is correctly applied (search for the decorated function directly below the `@torch.no_grad` line and replace the decorator with `@torch.no_grad`()).

coderabbitai · 2026-02-13T14:53:03Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py

+    def __init__(
+        self,
+        transformer_config: Qwen3VLTransformerConfig,
+        transformer_layer_spec: ModuleSpec,
+        patch_merger_spec: ModuleSpec,
+        pre_process: bool = True,
+        post_process: bool = True,
+        pg_collection: ProcessGroupCollection = None,
+    ) -> None:
+        assert post_process and pre_process, "not support pp for deepstack_merger_list"
+        super().__init__(config=transformer_config)
+        self.spatial_merge_size = transformer_config.spatial_merge_size
+        self.patch_size = transformer_config.patch_size
+        self.spatial_merge_unit = self.spatial_merge_size * self.spatial_merge_size
+        self.pg_collection = pg_collection
+        self.tp_group = self.pg_collection.tp
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, examine the file to confirm the constructor signature cat -n src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py | head -60

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3232

🏁 Script executed:

# Search for all instantiations of Qwen3VLVisionModel rg -n "Qwen3VLVisionModel\s*\(" -g '*.py' -A 5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1442

🏁 Script executed:

# Also search for how pg_collection is typically passed in this codebase rg -n "pg_collection\s*=" -g '*.py' -B 3 -A 1 | head -100

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 8803

🏁 Script executed:

sed -n '125,145p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1315

🏁 Script executed:

# Check the context around line 128 in model.py to see where pg_collection comes from sed -n '100,135p' src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1872

🏁 Script executed:

# Also check the function signature that contains this call grep -B 20 "self.vision_model = Qwen3VLVisionModel" src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py | head -30

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1565

🏁 Script executed:

# Check if there are any other calls to Qwen3VLVisionModel beyond model.py rg -n "Qwen3VLVisionModel\(" -g '*.py' -B 2 -A 6

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2086

Validate pg_collection before dereferencing.

pg_collection defaults to None on line 42, but line 50 immediately accesses .tp without checking. If a caller instantiates Qwen3VLVisionModel without passing pg_collection, this will crash at construction time. Make the parameter required or provide a safe default.

Suggested fix

- self.pg_collection = pg_collection - self.tp_group = self.pg_collection.tp + if pg_collection is None: + raise ValueError("pg_collection must be provided for Qwen3VLVisionModel") + self.pg_collection = pg_collection + self.tp_group = self.pg_collection.tp

🤖 Prompt for AI Agents

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/vision_model.py` around lines 35 - 51, The constructor __init__ of Qwen3VLVisionModel dereferences pg_collection.tp without checking pg_collection (pg_collection defaults to None), so either require pg_collection be passed or validate it: update __init__ to assert or raise a clear ValueError if pg_collection is None (e.g., "pg_collection must be provided"), and only set self.tp_group = self.pg_collection.tp after that check (or remove the default None from the pg_collection parameter to make it required); ensure the validation and assignment occur before any use of self.tp_group.

coderabbitai · 2026-02-13T14:53:03Z

src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py

+                # QKV mapping for vision model
+                ConcatenatedQKVMapping(
+                    megatron_param="vision_model.decoder.layers.*.self_attention.linear_qkv.weight",
+                    hf_param="model.visual.blocks.*.attn.qkv.weight",
+                ),
+                ConcatenatedQKVMapping(
+                    megatron_param="vision_model.decoder.layers.*.self_attention.linear_qkv.bias",
+                    hf_param="model.visual.blocks.*.attn.qkv.bias",
+                ),
                ReplicatedMapping(
-                    megatron_param="vision_model.**",
-                    hf_param="model.visual.**",
+                    megatron_param="vision_model.patch_embed.proj.**",
+                    hf_param="model.visual.patch_embed.proj.**",
+                ),
+                ReplicatedMapping(
+                    megatron_param="vision_model.pos_embed.**",
+                    hf_param="model.visual.pos_embed.**",
                ),


⚠️ Potential issue | 🟠 Major

Mirror the vision QKV mapping in the MoE bridge.

The dense bridge now uses ConcatenatedQKVMapping for vision QKV, but the MoE bridge still replicates the entire vision module. If the vision QKV shapes differ (as implied by this change), MoE conversions will still be incorrect. Consider applying the same ConcatenatedQKVMapping entries (or refactor to share a common vision mapping block).

🤖 Prompt for AI Agents

In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py` around lines 195 - 211, The MoE bridge still uses ReplicatedMapping for the vision module while the dense bridge switched to ConcatenatedQKVMapping for vision QKV; update the MoE bridge to mirror those same mappings (add ConcatenatedQKVMapping entries for vision_model.decoder.layers.*.self_attention.linear_qkv.weight and .bias mapping to model.visual.blocks.*.attn.qkv.weight and .bias, and keep ReplicatedMapping for vision_model.patch_embed.proj.** and vision_model.pos_embed.**) or refactor both bridges to import/share a common vision mapping block so the MoE conversion uses the same QKV handling as the dense bridge.

coderabbitai · 2026-02-13T14:53:03Z

src/megatron/bridge/training/utils/packed_seq_utils.py

+    batch_size = input_ids.shape[0]
+
+    seqlens_in_batch = (
+        attention_mask.sum(dim=-1, dtype=torch.int32)
+        if attention_mask is not None
+        else torch.ones((batch_size), dtype=torch.int32, device=input_ids.device) * input_ids.shape[1]
+    )


⚠️ Potential issue | 🔴 Critical

Potential NoneType indexing error when attention_mask is None and pre_process=True.

Lines 88–92 handle a None attention_mask by computing seqlens_in_batch as all-ones, but attention_mask itself remains None. When pre_process=True, lines 137 and 145 use attention_mask[i] as a boolean index into input_ids, which will raise a TypeError.

Either guard the pre-processing block against None, or materialize a boolean mask when attention_mask is None:

Proposed fix

seqlens_in_batch = ( attention_mask.sum(dim=-1, dtype=torch.int32) if attention_mask is not None else torch.ones((batch_size), dtype=torch.int32, device=input_ids.device) * input_ids.shape[1] ) + if attention_mask is None: + attention_mask = torch.ones(input_ids.shape[:2], dtype=torch.bool, device=input_ids.device) tp_size = mpu.get_tensor_model_parallel_world_size()

Also applies to: 130-145

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/utils/packed_seq_utils.py` around lines 86 - 92, The code computes seqlens_in_batch when attention_mask is None but leaves attention_mask as None, which causes a TypeError later when pre_process=True and the code indexes attention_mask[i]; to fix, when attention_mask is None initialize a boolean mask (e.g., attention_mask = torch.ones((batch_size, input_ids.shape[1]), dtype=torch.bool, device=input_ids.device)) and keep seqlens_in_batch consistent (torch.int32 sums) so subsequent uses like attention_mask[i] and boolean indexing in the pre-processing block work correctly; update the branch that creates seqlens_in_batch (and the analogous 130–145 region) to materialize this mask instead of leaving attention_mask as None.

coderabbitai · 2026-02-13T14:53:03Z

src/megatron/bridge/training/vlm_step.py

+    is_first = True
+    is_last = True


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's find and examine the vlm_step.py file find . -name "vlm_step.py" -type f

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 113

🏁 Script executed:

#!/bin/bash # Read the relevant section of vlm_step.py (lines 120-200) FILE="src/megatron/bridge/training/vlm_step.py" if [ -f "$FILE" ]; then echo "=== Lines 120-200 of $FILE ===" sed -n '120,200p' "$FILE" | cat -n else echo "File not found at $FILE" # Try to find it find . -name "vlm_step.py" 2>/dev/null | head -5 fi

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4864

🏁 Script executed:

#!/bin/bash # Check for PP stage variables in the file FILE="src/megatron/bridge/training/vlm_step.py" if [ -f "$FILE" ]; then echo "=== Searching for is_pp_first_stage / is_pp_last_stage ===" rg "is_pp_first_stage|is_pp_last_stage" "$FILE" -n echo "" echo "=== Searching for if True ===" rg "if True" "$FILE" -n fi

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 266

🏁 Script executed:

#!/bin/bash # Check where vlm_forward_step and related functions are used rg "vlm_forward_step" --type=py -l | head -10

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 310

🏁 Script executed:

#!/bin/bash # Read more context around the hard-coded values and check for comments FILE="src/megatron/bridge/training/vlm_step.py" sed -n '100,130p' "$FILE" | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1224

🏁 Script executed:

#!/bin/bash # Check if there are any comments explaining the PP behavior or hard-coding FILE="src/megatron/bridge/training/vlm_step.py" sed -n '1,150p' "$FILE" | cat -n | tail -70

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3225

🏁 Script executed:

#!/bin/bash # Search for any recent changes or comments about CUDA graphs or PP behavior rg -i "cuda.*graph|cuda graph|first.*last.*pp|pipeline.*parallel" src/megatron/bridge/training/vlm_step.py -B2 -A2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 458

🏁 Script executed:

#!/bin/bash # Check the get_batch_from_iterator function to see how is_first and is_last are used rg -n "def get_batch_from_iterator" --type=py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 368

🏁 Script executed:

#!/bin/bash # Check if there's a git log showing when these values changed from dynamic to hard-coded if [ -d ".git" ]; then git log --oneline -20 -- src/megatron/bridge/training/vlm_step.py fi

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 109

🏁 Script executed:

#!/bin/bash # Check what the else branch (lines 159-192) contains and verify indentation FILE="src/megatron/bridge/training/vlm_step.py" sed -n '145,195p' "$FILE" | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3294

🏁 Script executed:

#!/bin/bash # Check the full context of get_batch_from_iterator to understand what is_first/is_last control FILE="src/megatron/bridge/training/vlm_step.py" sed -n '41,120p' "$FILE" | cat -n

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3257

Remove dead code (if True:) and document CUDA graph requirement for hard-coded PP stage values.

Lines 124–125 hard-code is_first = True and is_last = True, forcing all PP ranks to load labels and loss_mask from the data iterator (see get_batch_from_iterator line 36–37). The if True: at line 145 makes the else branch (lines 160–207) unreachable dead code; per coding guidelines, remove this dead branch and replace if True: with a proper condition or remove the conditional entirely.

If this hard-coding is intentional for CUDA graph support (per commit "by pass encoder for cuda graph"), add a comment explaining why all PP stages must load labels/loss_mask despite only the last PP stage needing them for loss computation. This clarifies the behavioral change for PP > 1 configurations.

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/vlm_step.py` around lines 124 - 125, The code currently hard-sets is_first = True and is_last = True and contains an unconditional "if True:" that makes the else branch unreachable; update the VLM step so it either (a) restores correct per-rank is_first/is_last logic (remove the hard-coded assignments and compute them from pipeline parallel rank/size) and remove the "if True:" so the else branch is used, or (b) if the hard-coding was intentional for CUDA graph support, leave the behavior but remove the dead else branch and add a clear comment above the assignments explaining: why all PP ranks must load labels/loss_mask (reference get_batch_from_iterator where labels/loss_mask are loaded), why this bypasses the encoder for CUDA graph, and which specific condition (e.g., CUDA_GRAPH_ENABLED) justifies the behavior; update the conditional around the former "if True:" to a meaningful check or delete it and keep only the active path so there is no unreachable code.

shifangx added 2 commits February 13, 2026 04:49

qwen3-vl m4 leftover

7ed06e2

m4 leftover for TE cuda graph

4c4ef34

shifangx requested review from a team, erhoo82 and malay-nagda as code owners February 13, 2026 14:38

shifangx changed the title ~~m4 leftover for TE cuda graph~~ M4 leftover for TE cuda graph Feb 13, 2026

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

shifangx force-pushed the shifang/cuda_graph_m4 branch from 1be02af to 4c4ef34 Compare February 14, 2026 02:17

shifangx mentioned this pull request Feb 14, 2026

Adding CUDA Graph Support for Vision Encoder #2334

Open

5 tasks

		return new_pixel_values, new_image_grid_thws, cp_img_num, images_padded


		@torch.no_grad

Conversation

shifangx commented Feb 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Feb 13, 2026

Uh oh!

coderabbitai bot commented Feb 13, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shifangx commented Feb 13, 2026 •

edited by coderabbitai bot

Loading