[bridge, model] Qwen 3.5 VL Bridge by cuichenx · Pull Request #2530 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-02-25T18:59:13Z

What does this PR do?

Add Megatron Bridge support for Qwen3.5 Vision-Language models, enabling bidirectional HF ↔ Megatron checkpoint conversion for both the dense and MoE variants.

Changelog

Add Qwen35VLBridge (dense, e.g. Qwen3.5-27B) and Qwen35VLMoEBridge (MoE, e.g. Qwen3.5-397B-A17B) with full weight mapping registries
Add Qwen35VLModelProvider and Qwen35VLMoEModelProvider for Megatron-Core model instantiation with hybrid Gated DeltaNet + Gated Attention architecture
Add GDNLinearMappingSeparate param mapping to handle Qwen3.5's separate QKV/Z/B/A projection layout (vs. Qwen3-Next's fused layout), plus _fuse_gdn_separate_to_grouped / _split_gdn_grouped_to_separate helpers
Add example conversion and Slurm inference scripts for Qwen3.5 VL
Minor fixes to compare.py roundtrip comparison script

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Extends the existing Qwen VL bridge family (Qwen25VLBridge, Qwen3VLBridge, Qwen3VLMoEBridge) to the new Qwen3.5 generation
Qwen3.5 uses a hybrid GDN + Attention architecture (3 GDN layers per 1 Attention layer) with either dense MLP or MoE with shared experts
Requires transformers >= 5.2.0 for MoE config class; dense config is guarded behind a separate import check

Summary by CodeRabbit

New Features
- Added Qwen3.5 Vision-Language model support (dense and MoE variants)
- Added conversion workflows and inference scripts for Qwen3.5-VL models, including SLURM multi-node support
- Added support for gated attention mechanisms in hybrid architectures
Bug Fixes
- Fixed handling of tuple outputs during model conversion
- Improved precision consistency during cross-rank broadcasts in model conversion
- Enhanced parameter comparison tolerance for specific model components

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-02-25T18:59:17Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

examples/models/vlm/qwen35_vl/slurm_peft.sh

Signed-off-by: Chen Cui <chcui@nvidia.com>

coderabbitai · 2026-02-27T01:28:14Z

📝 Walkthrough

Walkthrough

This pull request adds comprehensive support for Qwen3.5 Vision-Language models in both dense and MoE variants. It introduces new bridge and provider implementations, parameter mapping utilities for GDN components, example conversion and inference workflows via bash scripts, and refinements to existing comparison and attention mechanisms.

Changes

Cohort / File(s)	Summary
Qwen3.5 VL Bridge & Provider Implementations `src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py`, `src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`	New bridge classes (Qwen35VLBridge, Qwen35VLMoEBridge) with comprehensive parameter mapping registries for dense and MoE architectures; corresponding provider classes (Qwen35VLModelProvider, Qwen35VLMoEModelProvider) for model instantiation with vision encoder and hybrid GDN/gated-attention setup.
Module Exports & VL Model Infrastructure `src/megatron/bridge/models/__init__.py`, `src/megatron/bridge/models/qwen_vl/__init__.py`	Extended public API to export Qwen35VLBridge, Qwen35VLModelProvider, Qwen35VLMoEBridge, and Qwen35VLMoEModelProvider.
Parameter Mapping for GDN Separate Components `src/megatron/bridge/models/conversion/param_mapping.py`	Added GDNLinearMappingSeparate class and two helper functions (_fuse_gdn_separate_to_grouped, _split_gdn_grouped_to_separate) to handle separate HF tensor handling for QKV, Z, B, A weight matrices in Qwen3.5 models.
Qwen3 VL Model Attention & Configuration `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py`, `src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`, `src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py`	Added optional output gate path in attention forward; guarded attention module assignment with submodules check; added trivial init to ExpertMLPDownProjMapping.
Conversion & Roundtrip Comparison Scripts `examples/conversion/compare_hf_and_megatron/compare.py`, `examples/conversion/hf_megatron_roundtrip_multi_gpu.py`	Improved tuple output handling in vlm_forward_step, added MTP disabling workaround, enhanced garbage collection/CUDA cache clearing, explicit dtype normalization for broadcasts, restricted distributed logging visibility; added precision ignore parameters for A_log and linear_attn.norm.weight.
Qwen3.5 Conversion & Inference Workflows `examples/models/vlm/qwen35_vl/conversion.sh`, `examples/models/vlm/qwen35_vl/inference.sh`, `examples/models/vlm/qwen35_vl/slurm_inference.sh`	New bash scripts orchestrating model conversion/validation (supporting multiple Qwen variants with dtype/model-class selection), multi-mode inference (HF, Megatron, exported HF), and multi-node SLURM-based inference with parallelism setup, containerization support, and environment optimization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Nemotron-3-Nano Model Support #1914: Modifies HF-vs-Megatron comparison logic in compare.py (hf_logits dtype and broadcast handling), directly overlapping with comparison script refinements.
[model, refactor] refactor: Centralize provider_bridge config mapping in base class #2052: Extends bridge/provider and parameter-mapping infrastructure with changes to qwen_vl bridge/provider and param_mapping, sharing core architectural concerns.
[model, refactor] refactor: Centralize provider_bridge config mapping in base class for VLM models #2250: Adds Qwen3.5 VL bridges/providers and GDN parameter mapping machinery that directly extend and refactor the same qwen_vl bridge/provider infrastructure.

Suggested labels

enhancement

Suggested reviewers

erhoo82
ko3n1g
yaoyu-33

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.42% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR description explicitly states 'tests and docs updates remain incomplete,' indicating no test results are documented for these major bridge additions.	Include test execution results from conversion validation scripts and documentation of numerical correctness verification for both dense and MoE variants.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[bridge, model] Qwen 3.5 VL Bridge' clearly summarizes the main change: adding Megatron Bridge support for Qwen 3.5 Vision-Language models with bridge and model provider implementations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chcui/qwen3_5

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (6)

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (1)
87-88: Add documentation clarifying the conditional is intentional for pre-patched specs.

The guard correctly handles two legitimate patterns:

Qwen3 VL (standard/MoE) passes specs with .submodules structure from get_gpt_layer_with_transformer_engine_spec(), which get patched here

Qwen3.5 VL pre-patches via _patch_standard_attention_specs() before passing TransformerBlockSubmodules (which lacks the expected .submodules structure), so the conditional correctly skips redundant patching

All current call paths work correctly and required patching is not silently skipped. However, add a docstring or inline comment explaining that the hasattr check distinguishes between standard ModuleSpec (needs patching) and pre-patched TransformerBlockSubmodules (skips patching). This prevents future contributors from misinterpreting the conditional as a bug.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py` around lines
87 - 88, Add an inline comment or short docstring next to the conditional that
checks hasattr(language_transformer_layer_spec, "submodules") to explain it is
intentional: ModuleSpecs coming from
get_gpt_layer_with_transformer_engine_spec() contain a .submodules structure and
need the patch that assigns Qwen3VLSelfAttention, whereas
TransformerBlockSubmodules are pre-patched by _patch_standard_attention_specs()
(used by Qwen3.5 VL) and therefore should be skipped; reference both callers
(get_gpt_layer_with_transformer_engine_spec and _patch_standard_attention_specs)
and the Qwen3VLSelfAttention assignment to make the intent clear for future
readers.
examples/models/vlm/qwen35_vl/inference.sh (1)
15-43: Consider adding shell strict mode for safer execution.

Similar to the conversion script, adding set -euo pipefail would help catch failures in the inference pipeline.
♻️ Proposed fix
 # Workspace directory for checkpoints and results
+set -euo pipefail
+
 WORKSPACE=${WORKSPACE:-/workspace}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/vlm/qwen35_vl/inference.sh` around lines 15 - 43, The script
lacks shell strict mode; add a strict-mode line (set -euo pipefail) near the top
of the inference.sh (before WORKSPACE and MODEL_NAME definitions) so the shell
exits on errors, treats unset variables as errors, and fails pipelines properly;
ensure this change is applied to the top of
examples/models/vlm/qwen35_vl/inference.sh so all three uv run blocks benefit
from the stricter error handling.
examples/models/vlm/qwen35_vl/conversion.sh (1)
16-52: Consider adding shell strict mode for safer script execution.

Per Google Shell Style Guide, adding set -euo pipefail at the top helps catch errors early. This prevents the script from silently continuing if a conversion or validation step fails.
♻️ Proposed fix
 # Workspace directory for checkpoints and results
+set -euo pipefail
+
 WORKSPACE=${WORKSPACE:-/workspace}
 MODEL_NAME=Qwen3.5-35B-A3B # Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen3.5-27B
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/vlm/qwen35_vl/conversion.sh` around lines 16 - 52, Add shell
strict mode to the script by enabling "set -euo pipefail" at the top (before the
WORKSPACE/MODEL_NAME/HF_MODEL_CLASS assignments and any uv/python commands) so
the script exits on errors, treats unset variables as failures, and fails on
pipe errors; ensure this is the very first executable line so all subsequent
operations (the uv run convert_checkpoints.py import/export, compare.py
validation, and hf_megatron_roundtrip_multi_gpu.py calls) inherit the stricter
error handling.
examples/models/vlm/qwen35_vl/slurm_inference.sh (1)
146-147: Consider making sync wait time configurable or using a more robust synchronization method.

The hardcoded sleep 5 may be insufficient for large clusters with slow shared filesystems. Consider using a file-based barrier or making the wait time configurable.
♻️ Proposed improvement
+# Sync wait time (adjust for slow shared filesystems)
+SYNC_WAIT_SECONDS=${SYNC_WAIT_SECONDS:-10}
+
 # Only rank 0 on each node runs uv sync
-SYNC_CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 5; fi"
+SYNC_CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep \$SYNC_WAIT_SECONDS; fi"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/vlm/qwen35_vl/slurm_inference.sh` around lines 146 - 147, The
sync logic uses a hardcoded sleep ("SYNC_CMD" and "FULL_CMD") which can fail on
slow clusters; modify SYNC_CMD to use a configurable wait (e.g., read
WAIT_SECONDS from env with a sensible default) or replace the sleep-based gate
with a file-based barrier: have the rank 0 node run "uv sync" then create a
marker file (e.g., /tmp/uv_sync_done_${SLURM_JOB_ID}") and have other ranks poll
for that file with a timeout; update references to SYNC_CMD and FULL_CMD
accordingly and ensure polling/backoff and a configurable timeout are used
instead of fixed "sleep 5".
src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py (1)
218-275: Consider extracting shared param_mappings to reduce duplication.

The param_mappings dictionaries in Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry share significant overlap (vision model mappings, embeddings, GDN mappings, etc.). Consider extracting common mappings to a shared constant or helper method.

This would reduce maintenance burden when vision model mappings change. However, keeping them separate also has merit for clarity and allowing variant-specific divergence, so this is optional.

Also applies to: 489-542
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py` around lines 218 -
275, Duplicate param_mappings exist between Qwen35VLMoEBridge.mapping_registry
and Qwen35VLBridge.mapping_registry; extract the shared entries (e.g., vision
model mappings, embeddings, GDN mappings) into a single module-level constant
like COMMON_PARAM_MAPPINGS and then compose each class's mapping_registry by
copying/unpacking COMMON_PARAM_MAPPINGS and applying bridge-specific overrides
(preserving keys used in Qwen35VLMoEBridge.mapping_registry and
Qwen35VLBridge.mapping_registry); update both mapping_registry definitions to
reference the shared constant (keep variant-specific differences after
unpacking) and run tests for correctness (also apply same extraction to the
duplicate block at the other location referenced around lines 489-542).
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py (1)
35-35: Consider using modern union syntax for type hints.

Per coding guidelines, use T | None instead of Optional[T] for nullable types.
♻️ Proposed fix
-from typing import Any, Callable, List, Optional
+from typing import Any, Callable, List
Then update usages:
-    mtp_num_layers: Optional[int] = None
+    mtp_num_layers: int | None = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py` at line 35, Replace
usages of Optional[...] with the modern union syntax T | None across this module
and remove Optional from the import list; specifically update the import line
that currently includes "Optional" (the same line importing Any, Callable, List,
Optional) to drop Optional and refactor all type annotations that use
Optional[...] to use `| None` (e.g., change `FooOptionalType: Optional[Bar]` to
`FooOptionalType: Bar | None`). Ensure function signatures and class attributes
inside qwen35_vl_provider.py that reference Optional are updated to the new
syntax and that no lingering Optional imports remain.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 322-328: The tuple handling in compare.py around model_output is
assuming exactly two elements which can raise if arity differs; change the
branch that handles tuple outputs to always take the first element (e.g.,
output_tensor = model_output[0]) instead of unpacking two values, keeping the
else branch that assigns output_tensor = model_output for non-tuples and
returning output_tensor, loss_func; use the existing symbols model,
forward_args, model_output, output_tensor, and loss_func to locate and update
the logic.
- Around line 619-622: The loop that disables MTP for inference blindly sets
m.config.mtp_num_layers and m.config.grad_scale_func and will crash if those
attributes are absent; update the code in the megatron_model iteration to check
for the attributes (e.g., using hasattr or getattr) on m.config before assigning
and only set them to None when they exist (references: megatron_model,
m.config.mtp_num_layers, m.config.grad_scale_func).

In `@examples/models/vlm/qwen35_vl/inference.sh`:
- Line 17: Replace the hardcoded WORKSPACE assignment with a shell
parameter-expansion default so the script uses the caller's WORKSPACE if set;
specifically change the WORKSPACE variable assignment (symbol WORKSPACE) to use
${WORKSPACE:-/workspace} (and append /models/Qwen as needed) instead of the
fixed /chcui/mbridge_home/models/Qwen path so other users won’t be forced to the
developer’s local location.

In `@examples/models/vlm/qwen35_vl/slurm_inference.sh`:
- Around line 119-128: The GPU validation uses an incorrect formula; replace the
max(TP, EP)×PP logic by computing REQUIRED_GPUS = TP × PP × CP × DP (introduce
CP and DP variables), ensure DP is documented/derived as DP = EP × EDP if
applicable, and update the validation that compares REQUIRED_GPUS to TOTAL_GPUS;
adjust the echo messages to report TP, PP, CP, DP, EP and the relationship
between EP and DP so the check and diagnostic output in the slurm_inference.sh
script reference the new variables (TP, EP, PP, CP, DP, REQUIRED_GPUS,
TOTAL_GPUS).

In `@src/megatron/bridge/models/conversion/param_mapping.py`:
- Around line 1823-1826: The docstring bullet lines for the projection names
(symbols: in_proj_qkv, in_proj_z, in_proj_b, in_proj_a) use EN DASH characters
(–) which triggers RUF002; edit the docstring in
src/megatron/bridge/models/conversion/param_mapping.py and replace each EN DASH
with a standard ASCII hyphen-minus (-) in those bullet lines (and any other
bullets nearby) so the linter no longer flags ambiguous dash characters.
- Around line 2537-2582: The helper _fuse_gdn_separate_to_grouped assumes exact
flat shapes and currently lets torch.split/reshape raise cryptic errors; add
explicit shape validation at the top of _fuse_gdn_separate_to_grouped (and the
symmetric helper referenced around lines 2585-2623, e.g.
_fuse_gdn_grouped_to_separate) that computes qk_dim = config.linear_key_head_dim
* config.linear_num_key_heads, v_dim = config.linear_value_head_dim *
config.linear_num_value_heads and v_per_group = config.linear_num_value_heads //
config.linear_num_key_heads and then checks: qkv.shape == (qk_dim*2 + v_dim,
hidden_size), z.shape == (v_dim, hidden_size), b.shape ==
(config.linear_num_value_heads, hidden_size) or (config.linear_num_key_heads,
v_per_group, hidden_size) as appropriate, and a.shape similarly; if a check
fails raise a ValueError with a clear message naming the parameter (qkv/z/b/a),
the expected shape derived from config (list the dims) and the actual shape so
conversion fails fast and actionable before any torch.reshape/split is
attempted.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py`:
- Around line 105-111: The flash-decode early-return path skips applying the
attention output gate when self.config.attention_output_gate is true, causing
divergent decode behavior; update the flash-decode branch in the attention
forward code to call get_query_key_value_tensors(..., output_gate=True) when
attention_output_gate is enabled, capture the returned gate, and apply that same
gating logic to the flash-decode output before returning (mirror the gating used
in the non-early-return path so both paths use the gate variable returned by
get_query_key_value_tensors).

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Line 174: Rename the misspelled attribute name hetereogenous_dist_checkpoint
to heterogeneous_dist_checkpoint in qwen35_vl_provider.py and update all
references/usages to that symbol (e.g., any assignments, checks, or constructor
params that use hetereogenous_dist_checkpoint) so the code compiles and keeps
consistent naming across the module.
- Line 338: The variable name hetereogenous_dist_checkpoint is misspelled;
rename the symbol to heterogeneous_dist_checkpoint across the module (definition
and all references) so all assignments, argument names, and uses (e.g., the
variable in qwen35_vl_provider.py and any functions or classes that accept it)
consistently use the corrected spelling and update any related type hints or
docstrings accordingly.
- Around line 420-443: In _patch_standard_attention_specs, the
issubclass(attn_spec.module, SelfAttention) call can raise TypeError when
attn_spec.module is not a class; change the check to safely determine class-ness
before using issubclass (e.g., test isinstance(attn_spec.module, type) and then
issubclass, or wrap issubclass in a try/except TypeError), and ensure you still
only patch modules that both qualify as SelfAttention subclasses and expose the
linear_qkv submodule (use attn_spec.module and attn_spec.submodules.linear_qkv
as guards) so GDN-like modules remain unchanged.

---

Nitpick comments:
In `@examples/models/vlm/qwen35_vl/conversion.sh`:
- Around line 16-52: Add shell strict mode to the script by enabling "set -euo
pipefail" at the top (before the WORKSPACE/MODEL_NAME/HF_MODEL_CLASS assignments
and any uv/python commands) so the script exits on errors, treats unset
variables as failures, and fails on pipe errors; ensure this is the very first
executable line so all subsequent operations (the uv run convert_checkpoints.py
import/export, compare.py validation, and hf_megatron_roundtrip_multi_gpu.py
calls) inherit the stricter error handling.

In `@examples/models/vlm/qwen35_vl/inference.sh`:
- Around line 15-43: The script lacks shell strict mode; add a strict-mode line
(set -euo pipefail) near the top of the inference.sh (before WORKSPACE and
MODEL_NAME definitions) so the shell exits on errors, treats unset variables as
errors, and fails pipelines properly; ensure this change is applied to the top
of examples/models/vlm/qwen35_vl/inference.sh so all three uv run blocks benefit
from the stricter error handling.

In `@examples/models/vlm/qwen35_vl/slurm_inference.sh`:
- Around line 146-147: The sync logic uses a hardcoded sleep ("SYNC_CMD" and
"FULL_CMD") which can fail on slow clusters; modify SYNC_CMD to use a
configurable wait (e.g., read WAIT_SECONDS from env with a sensible default) or
replace the sleep-based gate with a file-based barrier: have the rank 0 node run
"uv sync" then create a marker file (e.g., /tmp/uv_sync_done_${SLURM_JOB_ID}")
and have other ranks poll for that file with a timeout; update references to
SYNC_CMD and FULL_CMD accordingly and ensure polling/backoff and a configurable
timeout are used instead of fixed "sleep 5".

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`:
- Around line 87-88: Add an inline comment or short docstring next to the
conditional that checks hasattr(language_transformer_layer_spec, "submodules")
to explain it is intentional: ModuleSpecs coming from
get_gpt_layer_with_transformer_engine_spec() contain a .submodules structure and
need the patch that assigns Qwen3VLSelfAttention, whereas
TransformerBlockSubmodules are pre-patched by _patch_standard_attention_specs()
(used by Qwen3.5 VL) and therefore should be skipped; reference both callers
(get_gpt_layer_with_transformer_engine_spec and _patch_standard_attention_specs)
and the Qwen3VLSelfAttention assignment to make the intent clear for future
readers.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py`:
- Around line 218-275: Duplicate param_mappings exist between
Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry; extract
the shared entries (e.g., vision model mappings, embeddings, GDN mappings) into
a single module-level constant like COMMON_PARAM_MAPPINGS and then compose each
class's mapping_registry by copying/unpacking COMMON_PARAM_MAPPINGS and applying
bridge-specific overrides (preserving keys used in
Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry); update
both mapping_registry definitions to reference the shared constant (keep
variant-specific differences after unpacking) and run tests for correctness
(also apply same extraction to the duplicate block at the other location
referenced around lines 489-542).

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Line 35: Replace usages of Optional[...] with the modern union syntax T | None
across this module and remove Optional from the import list; specifically update
the import line that currently includes "Optional" (the same line importing Any,
Callable, List, Optional) to drop Optional and refactor all type annotations
that use Optional[...] to use `| None` (e.g., change `FooOptionalType:
Optional[Bar]` to `FooOptionalType: Bar | None`). Ensure function signatures and
class attributes inside qwen35_vl_provider.py that reference Optional are
updated to the new syntax and that no lingering Optional imports remain.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3e44b4 and a2f99dc.

📒 Files selected for processing (13)

examples/conversion/compare_hf_and_megatron/compare.py
examples/conversion/hf_megatron_roundtrip_multi_gpu.py
examples/models/vlm/qwen35_vl/conversion.sh
examples/models/vlm/qwen35_vl/inference.sh
examples/models/vlm/qwen35_vl/slurm_inference.sh
src/megatron/bridge/models/__init__.py
src/megatron/bridge/models/conversion/param_mapping.py
src/megatron/bridge/models/qwen_vl/__init__.py
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py
src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py

examples/conversion/compare_hf_and_megatron/compare.py

coderabbitai · 2026-02-27T01:28:17Z

examples/conversion/compare_hf_and_megatron/compare.py

+    # Workaround: disable MTP for inference (causes hangs on NCCL collectives)
+    for m in megatron_model:
+        m.config.mtp_num_layers = None
+        m.config.grad_scale_func = None


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp/repo && head -650 examples/conversion/compare_hf_and_megatron/compare.py | tail -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 130

🏁 Script executed:

# Get more context around the problematic lines cd /tmp/repo && sed -n '600,630p' examples/conversion/compare_hf_and_megatron/compare.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 130

🏁 Script executed:

# Search for mtp_num_layers and grad_scale_func definitions and usages rg -n "mtp_num_layers|grad_scale_func" --type=py src examples

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4815

🏁 Script executed:

# Check what megatron_model is - search backwards from line 619 to find its assignment cd /tmp/repo && sed -n '1,625p' examples/conversion/compare_hf_and_megatron/compare.py | grep -n "megatron_model\s*=" | tail -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 130

🏁 Script executed:

# Look for config class definitions in megatron/core to understand structure fd "config" src/megatron --type f -name "*.py" | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 301

🏁 Script executed:

sed -n '580,635p' examples/conversion/compare_hf_and_megatron/compare.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2450

🏁 Script executed:

# Check the vlm/base.py file that does the same workaround sed -n '85,105p' src/megatron/bridge/inference/vlm/base.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 669

🏁 Script executed:

# Search for how megatron_model is populated before line 619 rg -n "megatron_model\s*=|megatron_model\.append|megatron_model\[" examples/conversion/compare_hf_and_megatron/compare.py -A 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 469

🏁 Script executed:

# Check if TransformerConfig or other config classes define these fields rg -n "class.*Config|mtp_num_layers.*=|grad_scale_func.*=" src/megatron/bridge/models -A 2 -B 2 | grep -A 3 -B 3 "class\|mtp_num_layers\|grad_scale_func"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 15938

Guard the MTP workaround with attribute checks.

The code unconditionally assigns to m.config.mtp_num_layers and m.config.grad_scale_func without verifying they exist. These fields are optional across different config types (as shown by Optional[int] definitions in provider configs and defensive getattr() patterns elsewhere in the codebase). Configs without these attributes will fail during this workaround.

Proposed fix

for m in megatron_model: - m.config.mtp_num_layers = None - m.config.grad_scale_func = None + if hasattr(m.config, "mtp_num_layers"): + m.config.mtp_num_layers = None + if hasattr(m.config, "grad_scale_func"): + m.config.grad_scale_func = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Workaround: disable MTP for inference (causes hangs on NCCL collectives)

for m in megatron_model:

m.config.mtp_num_layers = None

m.config.grad_scale_func = None

# Workaround: disable MTP for inference (causes hangs on NCCL collectives)

for m in megatron_model:

if hasattr(m.config, "mtp_num_layers"):

m.config.mtp_num_layers = None

if hasattr(m.config, "grad_scale_func"):

m.config.grad_scale_func = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 619 - 622, The loop that disables MTP for inference blindly sets m.config.mtp_num_layers and m.config.grad_scale_func and will crash if those attributes are absent; update the code in the megatron_model iteration to check for the attributes (e.g., using hasattr or getattr) on m.config before assigning and only set them to None when they exist (references: megatron_model, m.config.mtp_num_layers, m.config.grad_scale_func).

setting them as none has no issues

@cuichenx, understood! I've noted that setting these config attributes to None works without issues in this context. Thanks for clarifying.

✏️ Learnings added

Learnt from: cuichenx Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2530 File: examples/conversion/compare_hf_and_megatron/compare.py:619-622 Timestamp: 2026-02-27T02:48:46.395Z Learning: In `examples/conversion/compare_hf_and_megatron/compare.py`, when disabling MTP for inference by setting `m.config.mtp_num_layers = None` and `m.config.grad_scale_func = None` on megatron_model components, no attribute existence checks are needed—setting these to None works without issues even if the attributes don't initially exist.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: HollowMan6 Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2387 File: src/megatron/bridge/models/mimo/mimo_bridge.py:49-113 Timestamp: 2026-02-14T16:32:37.992Z Learning: In `src/megatron/bridge/models/mimo/mimo_bridge.py`, the mapping_registry method intentionally iterates over both "transformer_layer" and "mtp_model_layer" prefixes for MTP layer mappings. This dual-prefix support ensures compatibility because Megatron-Core may expose the same MTP submodule under either naming convention, allowing HF MTP weights to load correctly without requiring configuration overrides.

Learnt from: aroshanghias-nvd Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2040 File: src/megatron/bridge/models/mimo/llava_provider.py:109-113 Timestamp: 2026-01-28T17:39:49.959Z Learning: In `src/megatron/bridge/models/mimo/llava_provider.py`, the `TransformerConfig` used for the `MultimodalProjector` includes `num_attention_heads=1` as a required placeholder. While the MLP projector doesn't use attention, `TransformerConfig` requires a positive value for `num_attention_heads` (default of 0 is not valid), so using 1 as a placeholder is the correct approach.

Learnt from: aroshanghias-nvd Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2007 File: src/megatron/bridge/data/mimo/loaders.py:0-0 Timestamp: 2026-01-28T20:36:48.499Z Learning: In MIMO data loaders (`src/megatron/bridge/data/mimo/loaders.py`), use `cfg.train.micro_batch_size` directly instead of calculating `cfg.train.global_batch_size // dp_size`, as the latter ignores gradient accumulation steps. The correct relationship is `global_batch_size = micro_batch_size × dp_size × num_micro_batches`. This matches the standard pattern used in Bridge's non-MIMO data loaders.

Learnt from: CR Repo: NVIDIA-NeMo/Megatron-Bridge PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2026-01-28T19:56:04.048Z Learning: Applies to src/megatron/bridge/models/**/*.py : Handle tensor parallel and pipeline parallel distribution correctly in weight conversion

Learnt from: yaoyu-33 Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2342 File: src/megatron/bridge/models/qwen_omni/thinker_model.py:326-345 Timestamp: 2026-02-13T00:22:54.549Z Learning: In Python implementations (e.g., src/megatron/bridge/models/qwen_omni/thinker_model.py), when a feature is not supported (such as audio embeddings), raise an explicit error (e.g., NotImplementedError) instead of silently ignoring the input to fail fast with a clear message. Use a descriptive exception message that identifies the unsupported feature and the expected behavior.

hmm this breaks MTP in mamba model.. looks like gpt and mamba have different logic to determine if mtp is needed. For mamba, it still calls MTP layers even if mtp_num_layers=None (here) and it will crash in MTP forward pass (here)

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

examples/models/vlm/qwen35_vl/inference.sh

examples/models/vlm/qwen35_vl/slurm_inference.sh

src/megatron/bridge/models/conversion/param_mapping.py

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py

src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx · 2026-02-27T02:45:37Z

/ok to test 1a17945

Signed-off-by: Chen Cui <chcui@nvidia.com>

initial commit

8cb3ce6

Signed-off-by: Chen Cui <chcui@nvidia.com>

wplf reviewed Feb 26, 2026

View reviewed changes

examples/models/vlm/qwen35_vl/slurm_peft.sh Outdated Show resolved Hide resolved

cuichenx changed the title ~~[WIP] [model] Qwen 3.5 VL~~ [WIP] [model] Qwen 3.5 VL Bridge Feb 26, 2026

remove training scripts in this pr

b64c156

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx changed the title ~~[WIP] [model] Qwen 3.5 VL Bridge~~ [model] Qwen 3.5 VL Bridge Feb 26, 2026

cuichenx added 6 commits February 26, 2026 14:28

add dense model

907cec6

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'main' into chcui/qwen3_5

8237e82

Signed-off-by: Chen Cui <chcui@nvidia.com>

merge fixes

f1587ac

Signed-off-by: Chen Cui <chcui@nvidia.com>

fix compare.py

7071aff

Signed-off-by: Chen Cui <chcui@nvidia.com>

fix compare script

cbbdcf9

Signed-off-by: Chen Cui <chcui@nvidia.com>

clean up conversion

18f8f30

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx changed the title ~~[model] Qwen 3.5 VL Bridge~~ [bridge, model] Qwen 3.5 VL Bridge Feb 27, 2026

clean up inference examples

a2f99dc

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx marked this pull request as ready for review February 27, 2026 01:22

cuichenx marked this pull request as draft February 27, 2026 01:22

copy-pr-bot bot temporarily deployed to test February 27, 2026 01:23 Inactive

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

cuichenx added 2 commits February 26, 2026 17:43

add tests

a354bed

Signed-off-by: Chen Cui <chcui@nvidia.com>

fix test

1a17945

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx marked this pull request as ready for review February 27, 2026 02:45

cuichenx requested a review from a team as a code owner February 27, 2026 02:45

copy-pr-bot bot had a problem deploying to test February 27, 2026 02:46 Error

generic paths

a243021

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot had a problem deploying to test February 27, 2026 02:50 Error

remove wrong comment

2b228ec

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot had a problem deploying to test February 27, 2026 02:51 Error

address code rabbit's comments

ad9f155

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 27, 2026 17:41 Inactive

yaoyu-33 approved these changes Feb 28, 2026

View reviewed changes

chtruong814 approved these changes Feb 28, 2026

View reviewed changes

cuichenx merged commit 872c6ec into main Feb 28, 2026
65 of 67 checks passed

cuichenx deleted the chcui/qwen3_5 branch February 28, 2026 00:46

coderabbitai bot mentioned this pull request Mar 6, 2026

Support Energon dataset format for updated Qwen3-VL finetuning pipeline #2679

Closed

li126com mentioned this pull request Mar 9, 2026

[qwen35 0.8B error] ValueError: too many values to unpack #2701

Open

copy-pr-bot bot pushed a commit that referenced this pull request Mar 19, 2026

[bridge, model] Qwen 3.5 VL Bridge (#2530)

0c30a16

Signed-off-by: Chen Cui <chcui@nvidia.com>

Conversation

cuichenx commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

Uh oh!

coderabbitai bot commented Feb 27, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cuichenx Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

liding-nv Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cuichenx commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cuichenx commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot Feb 27, 2026 •

edited

Loading