Skip to content

[bridge, model] Qwen 3.5 VL Bridge#2530

Merged
cuichenx merged 15 commits intomainfrom
chcui/qwen3_5
Feb 28, 2026
Merged

[bridge, model] Qwen 3.5 VL Bridge#2530
cuichenx merged 15 commits intomainfrom
chcui/qwen3_5

Conversation

@cuichenx
Copy link
Contributor

@cuichenx cuichenx commented Feb 25, 2026

What does this PR do?

Add Megatron Bridge support for Qwen3.5 Vision-Language models, enabling bidirectional HF ↔ Megatron checkpoint conversion for both the dense and MoE variants.

Changelog

  • Add Qwen35VLBridge (dense, e.g. Qwen3.5-27B) and Qwen35VLMoEBridge (MoE, e.g. Qwen3.5-397B-A17B) with full weight mapping registries
  • Add Qwen35VLModelProvider and Qwen35VLMoEModelProvider for Megatron-Core model instantiation with hybrid Gated DeltaNet + Gated Attention architecture
  • Add GDNLinearMappingSeparate param mapping to handle Qwen3.5's separate QKV/Z/B/A projection layout (vs. Qwen3-Next's fused layout), plus _fuse_gdn_separate_to_grouped / _split_gdn_grouped_to_separate helpers
  • Add example conversion and Slurm inference scripts for Qwen3.5 VL
  • Minor fixes to compare.py roundtrip comparison script

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Extends the existing Qwen VL bridge family (Qwen25VLBridge, Qwen3VLBridge, Qwen3VLMoEBridge) to the new Qwen3.5 generation
  • Qwen3.5 uses a hybrid GDN + Attention architecture (3 GDN layers per 1 Attention layer) with either dense MLP or MoE with shared experts
  • Requires transformers >= 5.2.0 for MoE config class; dense config is guarded behind a separate import check

Summary by CodeRabbit

  • New Features

    • Added Qwen3.5 Vision-Language model support (dense and MoE variants)
    • Added conversion workflows and inference scripts for Qwen3.5-VL models, including SLURM multi-node support
    • Added support for gated attention mechanisms in hybrid architectures
  • Bug Fixes

    • Fixed handling of tuple outputs during model conversion
    • Improved precision consistency during cross-rank broadcasts in model conversion
    • Enhanced parameter comparison tolerance for specific model components

Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 25, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cuichenx cuichenx changed the title [WIP] [model] Qwen 3.5 VL [WIP] [model] Qwen 3.5 VL Bridge Feb 26, 2026
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx changed the title [WIP] [model] Qwen 3.5 VL Bridge [model] Qwen 3.5 VL Bridge Feb 26, 2026
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx changed the title [model] Qwen 3.5 VL Bridge [bridge, model] Qwen 3.5 VL Bridge Feb 27, 2026
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx marked this pull request as ready for review February 27, 2026 01:22
@cuichenx cuichenx marked this pull request as draft February 27, 2026 01:22
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

This pull request adds comprehensive support for Qwen3.5 Vision-Language models in both dense and MoE variants. It introduces new bridge and provider implementations, parameter mapping utilities for GDN components, example conversion and inference workflows via bash scripts, and refinements to existing comparison and attention mechanisms.

Changes

Cohort / File(s) Summary
Qwen3.5 VL Bridge & Provider Implementations
src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py, src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
New bridge classes (Qwen35VLBridge, Qwen35VLMoEBridge) with comprehensive parameter mapping registries for dense and MoE architectures; corresponding provider classes (Qwen35VLModelProvider, Qwen35VLMoEModelProvider) for model instantiation with vision encoder and hybrid GDN/gated-attention setup.
Module Exports & VL Model Infrastructure
src/megatron/bridge/models/__init__.py, src/megatron/bridge/models/qwen_vl/__init__.py
Extended public API to export Qwen35VLBridge, Qwen35VLModelProvider, Qwen35VLMoEBridge, and Qwen35VLMoEModelProvider.
Parameter Mapping for GDN Separate Components
src/megatron/bridge/models/conversion/param_mapping.py
Added GDNLinearMappingSeparate class and two helper functions (_fuse_gdn_separate_to_grouped, _split_gdn_grouped_to_separate) to handle separate HF tensor handling for QKV, Z, B, A weight matrices in Qwen3.5 models.
Qwen3 VL Model Attention & Configuration
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py, src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py
Added optional output gate path in attention forward; guarded attention module assignment with submodules check; added trivial init to ExpertMLPDownProjMapping.
Conversion & Roundtrip Comparison Scripts
examples/conversion/compare_hf_and_megatron/compare.py, examples/conversion/hf_megatron_roundtrip_multi_gpu.py
Improved tuple output handling in vlm_forward_step, added MTP disabling workaround, enhanced garbage collection/CUDA cache clearing, explicit dtype normalization for broadcasts, restricted distributed logging visibility; added precision ignore parameters for A_log and linear_attn.norm.weight.
Qwen3.5 Conversion & Inference Workflows
examples/models/vlm/qwen35_vl/conversion.sh, examples/models/vlm/qwen35_vl/inference.sh, examples/models/vlm/qwen35_vl/slurm_inference.sh
New bash scripts orchestrating model conversion/validation (supporting multiple Qwen variants with dtype/model-class selection), multi-mode inference (HF, Megatron, exported HF), and multi-node SLURM-based inference with parallelism setup, containerization support, and environment optimization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • erhoo82
  • ko3n1g
  • yaoyu-33
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.42% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR description explicitly states 'tests and docs updates remain incomplete,' indicating no test results are documented for these major bridge additions. Include test execution results from conversion validation scripts and documentation of numerical correctness verification for both dense and MoE variants.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[bridge, model] Qwen 3.5 VL Bridge' clearly summarizes the main change: adding Megatron Bridge support for Qwen 3.5 Vision-Language models with bridge and model provider implementations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chcui/qwen3_5

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (6)
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (1)

87-88: Add documentation clarifying the conditional is intentional for pre-patched specs.

The guard correctly handles two legitimate patterns:

  • Qwen3 VL (standard/MoE) passes specs with .submodules structure from get_gpt_layer_with_transformer_engine_spec(), which get patched here
  • Qwen3.5 VL pre-patches via _patch_standard_attention_specs() before passing TransformerBlockSubmodules (which lacks the expected .submodules structure), so the conditional correctly skips redundant patching

All current call paths work correctly and required patching is not silently skipped. However, add a docstring or inline comment explaining that the hasattr check distinguishes between standard ModuleSpec (needs patching) and pre-patched TransformerBlockSubmodules (skips patching). This prevents future contributors from misinterpreting the conditional as a bug.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py` around lines
87 - 88, Add an inline comment or short docstring next to the conditional that
checks hasattr(language_transformer_layer_spec, "submodules") to explain it is
intentional: ModuleSpecs coming from
get_gpt_layer_with_transformer_engine_spec() contain a .submodules structure and
need the patch that assigns Qwen3VLSelfAttention, whereas
TransformerBlockSubmodules are pre-patched by _patch_standard_attention_specs()
(used by Qwen3.5 VL) and therefore should be skipped; reference both callers
(get_gpt_layer_with_transformer_engine_spec and _patch_standard_attention_specs)
and the Qwen3VLSelfAttention assignment to make the intent clear for future
readers.
examples/models/vlm/qwen35_vl/inference.sh (1)

15-43: Consider adding shell strict mode for safer execution.

Similar to the conversion script, adding set -euo pipefail would help catch failures in the inference pipeline.

♻️ Proposed fix
 # Workspace directory for checkpoints and results
+set -euo pipefail
+
 WORKSPACE=${WORKSPACE:-/workspace}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/vlm/qwen35_vl/inference.sh` around lines 15 - 43, The script
lacks shell strict mode; add a strict-mode line (set -euo pipefail) near the top
of the inference.sh (before WORKSPACE and MODEL_NAME definitions) so the shell
exits on errors, treats unset variables as errors, and fails pipelines properly;
ensure this change is applied to the top of
examples/models/vlm/qwen35_vl/inference.sh so all three uv run blocks benefit
from the stricter error handling.
examples/models/vlm/qwen35_vl/conversion.sh (1)

16-52: Consider adding shell strict mode for safer script execution.

Per Google Shell Style Guide, adding set -euo pipefail at the top helps catch errors early. This prevents the script from silently continuing if a conversion or validation step fails.

♻️ Proposed fix
 # Workspace directory for checkpoints and results
+set -euo pipefail
+
 WORKSPACE=${WORKSPACE:-/workspace}
 MODEL_NAME=Qwen3.5-35B-A3B # Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen3.5-27B
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/vlm/qwen35_vl/conversion.sh` around lines 16 - 52, Add shell
strict mode to the script by enabling "set -euo pipefail" at the top (before the
WORKSPACE/MODEL_NAME/HF_MODEL_CLASS assignments and any uv/python commands) so
the script exits on errors, treats unset variables as failures, and fails on
pipe errors; ensure this is the very first executable line so all subsequent
operations (the uv run convert_checkpoints.py import/export, compare.py
validation, and hf_megatron_roundtrip_multi_gpu.py calls) inherit the stricter
error handling.
examples/models/vlm/qwen35_vl/slurm_inference.sh (1)

146-147: Consider making sync wait time configurable or using a more robust synchronization method.

The hardcoded sleep 5 may be insufficient for large clusters with slow shared filesystems. Consider using a file-based barrier or making the wait time configurable.

♻️ Proposed improvement
+# Sync wait time (adjust for slow shared filesystems)
+SYNC_WAIT_SECONDS=${SYNC_WAIT_SECONDS:-10}
+
 # Only rank 0 on each node runs uv sync
-SYNC_CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 5; fi"
+SYNC_CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep \$SYNC_WAIT_SECONDS; fi"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/vlm/qwen35_vl/slurm_inference.sh` around lines 146 - 147, The
sync logic uses a hardcoded sleep ("SYNC_CMD" and "FULL_CMD") which can fail on
slow clusters; modify SYNC_CMD to use a configurable wait (e.g., read
WAIT_SECONDS from env with a sensible default) or replace the sleep-based gate
with a file-based barrier: have the rank 0 node run "uv sync" then create a
marker file (e.g., /tmp/uv_sync_done_${SLURM_JOB_ID}") and have other ranks poll
for that file with a timeout; update references to SYNC_CMD and FULL_CMD
accordingly and ensure polling/backoff and a configurable timeout are used
instead of fixed "sleep 5".
src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py (1)

218-275: Consider extracting shared param_mappings to reduce duplication.

The param_mappings dictionaries in Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry share significant overlap (vision model mappings, embeddings, GDN mappings, etc.). Consider extracting common mappings to a shared constant or helper method.

This would reduce maintenance burden when vision model mappings change. However, keeping them separate also has merit for clarity and allowing variant-specific divergence, so this is optional.

Also applies to: 489-542

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py` around lines 218 -
275, Duplicate param_mappings exist between Qwen35VLMoEBridge.mapping_registry
and Qwen35VLBridge.mapping_registry; extract the shared entries (e.g., vision
model mappings, embeddings, GDN mappings) into a single module-level constant
like COMMON_PARAM_MAPPINGS and then compose each class's mapping_registry by
copying/unpacking COMMON_PARAM_MAPPINGS and applying bridge-specific overrides
(preserving keys used in Qwen35VLMoEBridge.mapping_registry and
Qwen35VLBridge.mapping_registry); update both mapping_registry definitions to
reference the shared constant (keep variant-specific differences after
unpacking) and run tests for correctness (also apply same extraction to the
duplicate block at the other location referenced around lines 489-542).
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py (1)

35-35: Consider using modern union syntax for type hints.

Per coding guidelines, use T | None instead of Optional[T] for nullable types.

♻️ Proposed fix
-from typing import Any, Callable, List, Optional
+from typing import Any, Callable, List

Then update usages:

-    mtp_num_layers: Optional[int] = None
+    mtp_num_layers: int | None = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py` at line 35, Replace
usages of Optional[...] with the modern union syntax T | None across this module
and remove Optional from the import list; specifically update the import line
that currently includes "Optional" (the same line importing Any, Callable, List,
Optional) to drop Optional and refactor all type annotations that use
Optional[...] to use `| None` (e.g., change `FooOptionalType: Optional[Bar]` to
`FooOptionalType: Bar | None`). Ensure function signatures and class attributes
inside qwen35_vl_provider.py that reference Optional are updated to the new
syntax and that no lingering Optional imports remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 322-328: The tuple handling in compare.py around model_output is
assuming exactly two elements which can raise if arity differs; change the
branch that handles tuple outputs to always take the first element (e.g.,
output_tensor = model_output[0]) instead of unpacking two values, keeping the
else branch that assigns output_tensor = model_output for non-tuples and
returning output_tensor, loss_func; use the existing symbols model,
forward_args, model_output, output_tensor, and loss_func to locate and update
the logic.
- Around line 619-622: The loop that disables MTP for inference blindly sets
m.config.mtp_num_layers and m.config.grad_scale_func and will crash if those
attributes are absent; update the code in the megatron_model iteration to check
for the attributes (e.g., using hasattr or getattr) on m.config before assigning
and only set them to None when they exist (references: megatron_model,
m.config.mtp_num_layers, m.config.grad_scale_func).

In `@examples/models/vlm/qwen35_vl/inference.sh`:
- Line 17: Replace the hardcoded WORKSPACE assignment with a shell
parameter-expansion default so the script uses the caller's WORKSPACE if set;
specifically change the WORKSPACE variable assignment (symbol WORKSPACE) to use
${WORKSPACE:-/workspace} (and append /models/Qwen as needed) instead of the
fixed /chcui/mbridge_home/models/Qwen path so other users won’t be forced to the
developer’s local location.

In `@examples/models/vlm/qwen35_vl/slurm_inference.sh`:
- Around line 119-128: The GPU validation uses an incorrect formula; replace the
max(TP, EP)×PP logic by computing REQUIRED_GPUS = TP × PP × CP × DP (introduce
CP and DP variables), ensure DP is documented/derived as DP = EP × EDP if
applicable, and update the validation that compares REQUIRED_GPUS to TOTAL_GPUS;
adjust the echo messages to report TP, PP, CP, DP, EP and the relationship
between EP and DP so the check and diagnostic output in the slurm_inference.sh
script reference the new variables (TP, EP, PP, CP, DP, REQUIRED_GPUS,
TOTAL_GPUS).

In `@src/megatron/bridge/models/conversion/param_mapping.py`:
- Around line 1823-1826: The docstring bullet lines for the projection names
(symbols: in_proj_qkv, in_proj_z, in_proj_b, in_proj_a) use EN DASH characters
(–) which triggers RUF002; edit the docstring in
src/megatron/bridge/models/conversion/param_mapping.py and replace each EN DASH
with a standard ASCII hyphen-minus (-) in those bullet lines (and any other
bullets nearby) so the linter no longer flags ambiguous dash characters.
- Around line 2537-2582: The helper _fuse_gdn_separate_to_grouped assumes exact
flat shapes and currently lets torch.split/reshape raise cryptic errors; add
explicit shape validation at the top of _fuse_gdn_separate_to_grouped (and the
symmetric helper referenced around lines 2585-2623, e.g.
_fuse_gdn_grouped_to_separate) that computes qk_dim = config.linear_key_head_dim
* config.linear_num_key_heads, v_dim = config.linear_value_head_dim *
config.linear_num_value_heads and v_per_group = config.linear_num_value_heads //
config.linear_num_key_heads and then checks: qkv.shape == (qk_dim*2 + v_dim,
hidden_size), z.shape == (v_dim, hidden_size), b.shape ==
(config.linear_num_value_heads, hidden_size) or (config.linear_num_key_heads,
v_per_group, hidden_size) as appropriate, and a.shape similarly; if a check
fails raise a ValueError with a clear message naming the parameter (qkv/z/b/a),
the expected shape derived from config (list the dims) and the actual shape so
conversion fails fast and actionable before any torch.reshape/split is
attempted.

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py`:
- Around line 105-111: The flash-decode early-return path skips applying the
attention output gate when self.config.attention_output_gate is true, causing
divergent decode behavior; update the flash-decode branch in the attention
forward code to call get_query_key_value_tensors(..., output_gate=True) when
attention_output_gate is enabled, capture the returned gate, and apply that same
gating logic to the flash-decode output before returning (mirror the gating used
in the non-early-return path so both paths use the gate variable returned by
get_query_key_value_tensors).

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Line 174: Rename the misspelled attribute name hetereogenous_dist_checkpoint
to heterogeneous_dist_checkpoint in qwen35_vl_provider.py and update all
references/usages to that symbol (e.g., any assignments, checks, or constructor
params that use hetereogenous_dist_checkpoint) so the code compiles and keeps
consistent naming across the module.
- Line 338: The variable name hetereogenous_dist_checkpoint is misspelled;
rename the symbol to heterogeneous_dist_checkpoint across the module (definition
and all references) so all assignments, argument names, and uses (e.g., the
variable in qwen35_vl_provider.py and any functions or classes that accept it)
consistently use the corrected spelling and update any related type hints or
docstrings accordingly.
- Around line 420-443: In _patch_standard_attention_specs, the
issubclass(attn_spec.module, SelfAttention) call can raise TypeError when
attn_spec.module is not a class; change the check to safely determine class-ness
before using issubclass (e.g., test isinstance(attn_spec.module, type) and then
issubclass, or wrap issubclass in a try/except TypeError), and ensure you still
only patch modules that both qualify as SelfAttention subclasses and expose the
linear_qkv submodule (use attn_spec.module and attn_spec.submodules.linear_qkv
as guards) so GDN-like modules remain unchanged.

---

Nitpick comments:
In `@examples/models/vlm/qwen35_vl/conversion.sh`:
- Around line 16-52: Add shell strict mode to the script by enabling "set -euo
pipefail" at the top (before the WORKSPACE/MODEL_NAME/HF_MODEL_CLASS assignments
and any uv/python commands) so the script exits on errors, treats unset
variables as failures, and fails on pipe errors; ensure this is the very first
executable line so all subsequent operations (the uv run convert_checkpoints.py
import/export, compare.py validation, and hf_megatron_roundtrip_multi_gpu.py
calls) inherit the stricter error handling.

In `@examples/models/vlm/qwen35_vl/inference.sh`:
- Around line 15-43: The script lacks shell strict mode; add a strict-mode line
(set -euo pipefail) near the top of the inference.sh (before WORKSPACE and
MODEL_NAME definitions) so the shell exits on errors, treats unset variables as
errors, and fails pipelines properly; ensure this change is applied to the top
of examples/models/vlm/qwen35_vl/inference.sh so all three uv run blocks benefit
from the stricter error handling.

In `@examples/models/vlm/qwen35_vl/slurm_inference.sh`:
- Around line 146-147: The sync logic uses a hardcoded sleep ("SYNC_CMD" and
"FULL_CMD") which can fail on slow clusters; modify SYNC_CMD to use a
configurable wait (e.g., read WAIT_SECONDS from env with a sensible default) or
replace the sleep-based gate with a file-based barrier: have the rank 0 node run
"uv sync" then create a marker file (e.g., /tmp/uv_sync_done_${SLURM_JOB_ID}")
and have other ranks poll for that file with a timeout; update references to
SYNC_CMD and FULL_CMD accordingly and ensure polling/backoff and a configurable
timeout are used instead of fixed "sleep 5".

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`:
- Around line 87-88: Add an inline comment or short docstring next to the
conditional that checks hasattr(language_transformer_layer_spec, "submodules")
to explain it is intentional: ModuleSpecs coming from
get_gpt_layer_with_transformer_engine_spec() contain a .submodules structure and
need the patch that assigns Qwen3VLSelfAttention, whereas
TransformerBlockSubmodules are pre-patched by _patch_standard_attention_specs()
(used by Qwen3.5 VL) and therefore should be skipped; reference both callers
(get_gpt_layer_with_transformer_engine_spec and _patch_standard_attention_specs)
and the Qwen3VLSelfAttention assignment to make the intent clear for future
readers.

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py`:
- Around line 218-275: Duplicate param_mappings exist between
Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry; extract
the shared entries (e.g., vision model mappings, embeddings, GDN mappings) into
a single module-level constant like COMMON_PARAM_MAPPINGS and then compose each
class's mapping_registry by copying/unpacking COMMON_PARAM_MAPPINGS and applying
bridge-specific overrides (preserving keys used in
Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry); update
both mapping_registry definitions to reference the shared constant (keep
variant-specific differences after unpacking) and run tests for correctness
(also apply same extraction to the duplicate block at the other location
referenced around lines 489-542).

In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Line 35: Replace usages of Optional[...] with the modern union syntax T | None
across this module and remove Optional from the import list; specifically update
the import line that currently includes "Optional" (the same line importing Any,
Callable, List, Optional) to drop Optional and refactor all type annotations
that use Optional[...] to use `| None` (e.g., change `FooOptionalType:
Optional[Bar]` to `FooOptionalType: Bar | None`). Ensure function signatures and
class attributes inside qwen35_vl_provider.py that reference Optional are
updated to the new syntax and that no lingering Optional imports remain.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3e44b4 and a2f99dc.

📒 Files selected for processing (13)
  • examples/conversion/compare_hf_and_megatron/compare.py
  • examples/conversion/hf_megatron_roundtrip_multi_gpu.py
  • examples/models/vlm/qwen35_vl/conversion.sh
  • examples/models/vlm/qwen35_vl/inference.sh
  • examples/models/vlm/qwen35_vl/slurm_inference.sh
  • src/megatron/bridge/models/__init__.py
  • src/megatron/bridge/models/conversion/param_mapping.py
  • src/megatron/bridge/models/qwen_vl/__init__.py
  • src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py
  • src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py
  • src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py
  • src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
  • src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py

Comment on lines +619 to +622
# Workaround: disable MTP for inference (causes hangs on NCCL collectives)
for m in megatron_model:
m.config.mtp_num_layers = None
m.config.grad_scale_func = None
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp/repo && head -650 examples/conversion/compare_hf_and_megatron/compare.py | tail -50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 130


🏁 Script executed:

# Get more context around the problematic lines
cd /tmp/repo && sed -n '600,630p' examples/conversion/compare_hf_and_megatron/compare.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 130


🏁 Script executed:

# Search for mtp_num_layers and grad_scale_func definitions and usages
rg -n "mtp_num_layers|grad_scale_func" --type=py src examples

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 4815


🏁 Script executed:

# Check what megatron_model is - search backwards from line 619 to find its assignment
cd /tmp/repo && sed -n '1,625p' examples/conversion/compare_hf_and_megatron/compare.py | grep -n "megatron_model\s*=" | tail -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 130


🏁 Script executed:

# Look for config class definitions in megatron/core to understand structure
fd "config" src/megatron --type f -name "*.py" | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 301


🏁 Script executed:

sed -n '580,635p' examples/conversion/compare_hf_and_megatron/compare.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2450


🏁 Script executed:

# Check the vlm/base.py file that does the same workaround
sed -n '85,105p' src/megatron/bridge/inference/vlm/base.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 669


🏁 Script executed:

# Search for how megatron_model is populated before line 619
rg -n "megatron_model\s*=|megatron_model\.append|megatron_model\[" examples/conversion/compare_hf_and_megatron/compare.py -A 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 469


🏁 Script executed:

# Check if TransformerConfig or other config classes define these fields
rg -n "class.*Config|mtp_num_layers.*=|grad_scale_func.*=" src/megatron/bridge/models -A 2 -B 2 | grep -A 3 -B 3 "class\|mtp_num_layers\|grad_scale_func"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 15938


Guard the MTP workaround with attribute checks.

The code unconditionally assigns to m.config.mtp_num_layers and m.config.grad_scale_func without verifying they exist. These fields are optional across different config types (as shown by Optional[int] definitions in provider configs and defensive getattr() patterns elsewhere in the codebase). Configs without these attributes will fail during this workaround.

Proposed fix
    for m in megatron_model:
-        m.config.mtp_num_layers = None
-        m.config.grad_scale_func = None
+        if hasattr(m.config, "mtp_num_layers"):
+            m.config.mtp_num_layers = None
+        if hasattr(m.config, "grad_scale_func"):
+            m.config.grad_scale_func = None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Workaround: disable MTP for inference (causes hangs on NCCL collectives)
for m in megatron_model:
m.config.mtp_num_layers = None
m.config.grad_scale_func = None
# Workaround: disable MTP for inference (causes hangs on NCCL collectives)
for m in megatron_model:
if hasattr(m.config, "mtp_num_layers"):
m.config.mtp_num_layers = None
if hasattr(m.config, "grad_scale_func"):
m.config.grad_scale_func = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 619 -
622, The loop that disables MTP for inference blindly sets
m.config.mtp_num_layers and m.config.grad_scale_func and will crash if those
attributes are absent; update the code in the megatron_model iteration to check
for the attributes (e.g., using hasattr or getattr) on m.config before assigning
and only set them to None when they exist (references: megatron_model,
m.config.mtp_num_layers, m.config.grad_scale_func).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting them as none has no issues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cuichenx, understood! I've noted that setting these config attributes to None works without issues in this context. Thanks for clarifying.


✏️ Learnings added
Learnt from: cuichenx
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2530
File: examples/conversion/compare_hf_and_megatron/compare.py:619-622
Timestamp: 2026-02-27T02:48:46.395Z
Learning: In `examples/conversion/compare_hf_and_megatron/compare.py`, when disabling MTP for inference by setting `m.config.mtp_num_layers = None` and `m.config.grad_scale_func = None` on megatron_model components, no attribute existence checks are needed—setting these to None works without issues even if the attributes don't initially exist.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: HollowMan6
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2387
File: src/megatron/bridge/models/mimo/mimo_bridge.py:49-113
Timestamp: 2026-02-14T16:32:37.992Z
Learning: In `src/megatron/bridge/models/mimo/mimo_bridge.py`, the mapping_registry method intentionally iterates over both "transformer_layer" and "mtp_model_layer" prefixes for MTP layer mappings. This dual-prefix support ensures compatibility because Megatron-Core may expose the same MTP submodule under either naming convention, allowing HF MTP weights to load correctly without requiring configuration overrides.

Learnt from: aroshanghias-nvd
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2040
File: src/megatron/bridge/models/mimo/llava_provider.py:109-113
Timestamp: 2026-01-28T17:39:49.959Z
Learning: In `src/megatron/bridge/models/mimo/llava_provider.py`, the `TransformerConfig` used for the `MultimodalProjector` includes `num_attention_heads=1` as a required placeholder. While the MLP projector doesn't use attention, `TransformerConfig` requires a positive value for `num_attention_heads` (default of 0 is not valid), so using 1 as a placeholder is the correct approach.

Learnt from: aroshanghias-nvd
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2007
File: src/megatron/bridge/data/mimo/loaders.py:0-0
Timestamp: 2026-01-28T20:36:48.499Z
Learning: In MIMO data loaders (`src/megatron/bridge/data/mimo/loaders.py`), use `cfg.train.micro_batch_size` directly instead of calculating `cfg.train.global_batch_size // dp_size`, as the latter ignores gradient accumulation steps. The correct relationship is `global_batch_size = micro_batch_size × dp_size × num_micro_batches`. This matches the standard pattern used in Bridge's non-MIMO data loaders.

Learnt from: CR
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2026-01-28T19:56:04.048Z
Learning: Applies to src/megatron/bridge/models/**/*.py : Handle tensor parallel and pipeline parallel distribution correctly in weight conversion

Learnt from: yaoyu-33
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2342
File: src/megatron/bridge/models/qwen_omni/thinker_model.py:326-345
Timestamp: 2026-02-13T00:22:54.549Z
Learning: In Python implementations (e.g., src/megatron/bridge/models/qwen_omni/thinker_model.py), when a feature is not supported (such as audio embeddings), raise an explicit error (e.g., NotImplementedError) instead of silently ignoring the input to fail fast with a clear message. Use a descriptive exception message that identifies the unsupported feature and the expected behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this breaks MTP in mamba model.. looks like gpt and mamba have different logic to determine if mtp is needed. For mamba, it still calls MTP layers even if mtp_num_layers=None (here) and it will crash in MTP forward pass (here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx
Copy link
Contributor Author

/ok to test 1a17945

@cuichenx cuichenx marked this pull request as ready for review February 27, 2026 02:45
@cuichenx cuichenx requested a review from a team as a code owner February 27, 2026 02:45
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx merged commit 872c6ec into main Feb 28, 2026
65 of 67 checks passed
@cuichenx cuichenx deleted the chcui/qwen3_5 branch February 28, 2026 00:46
copy-pr-bot bot pushed a commit that referenced this pull request Mar 19, 2026
Signed-off-by: Chen Cui <chcui@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants