Conversation
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
📝 WalkthroughWalkthroughThis pull request adds comprehensive support for Qwen3.5 Vision-Language models in both dense and MoE variants. It introduces new bridge and provider implementations, parameter mapping utilities for GDN components, example conversion and inference workflows via bash scripts, and refinements to existing comparison and attention mechanisms. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (6)
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (1)
87-88: Add documentation clarifying the conditional is intentional for pre-patched specs.The guard correctly handles two legitimate patterns:
- Qwen3 VL (standard/MoE) passes specs with
.submodulesstructure fromget_gpt_layer_with_transformer_engine_spec(), which get patched here- Qwen3.5 VL pre-patches via
_patch_standard_attention_specs()before passingTransformerBlockSubmodules(which lacks the expected.submodulesstructure), so the conditional correctly skips redundant patchingAll current call paths work correctly and required patching is not silently skipped. However, add a docstring or inline comment explaining that the
hasattrcheck distinguishes between standard ModuleSpec (needs patching) and pre-patched TransformerBlockSubmodules (skips patching). This prevents future contributors from misinterpreting the conditional as a bug.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py` around lines 87 - 88, Add an inline comment or short docstring next to the conditional that checks hasattr(language_transformer_layer_spec, "submodules") to explain it is intentional: ModuleSpecs coming from get_gpt_layer_with_transformer_engine_spec() contain a .submodules structure and need the patch that assigns Qwen3VLSelfAttention, whereas TransformerBlockSubmodules are pre-patched by _patch_standard_attention_specs() (used by Qwen3.5 VL) and therefore should be skipped; reference both callers (get_gpt_layer_with_transformer_engine_spec and _patch_standard_attention_specs) and the Qwen3VLSelfAttention assignment to make the intent clear for future readers.examples/models/vlm/qwen35_vl/inference.sh (1)
15-43: Consider adding shell strict mode for safer execution.Similar to the conversion script, adding
set -euo pipefailwould help catch failures in the inference pipeline.♻️ Proposed fix
# Workspace directory for checkpoints and results +set -euo pipefail + WORKSPACE=${WORKSPACE:-/workspace}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/vlm/qwen35_vl/inference.sh` around lines 15 - 43, The script lacks shell strict mode; add a strict-mode line (set -euo pipefail) near the top of the inference.sh (before WORKSPACE and MODEL_NAME definitions) so the shell exits on errors, treats unset variables as errors, and fails pipelines properly; ensure this change is applied to the top of examples/models/vlm/qwen35_vl/inference.sh so all three uv run blocks benefit from the stricter error handling.examples/models/vlm/qwen35_vl/conversion.sh (1)
16-52: Consider adding shell strict mode for safer script execution.Per Google Shell Style Guide, adding
set -euo pipefailat the top helps catch errors early. This prevents the script from silently continuing if a conversion or validation step fails.♻️ Proposed fix
# Workspace directory for checkpoints and results +set -euo pipefail + WORKSPACE=${WORKSPACE:-/workspace} MODEL_NAME=Qwen3.5-35B-A3B # Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen3.5-27B🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/vlm/qwen35_vl/conversion.sh` around lines 16 - 52, Add shell strict mode to the script by enabling "set -euo pipefail" at the top (before the WORKSPACE/MODEL_NAME/HF_MODEL_CLASS assignments and any uv/python commands) so the script exits on errors, treats unset variables as failures, and fails on pipe errors; ensure this is the very first executable line so all subsequent operations (the uv run convert_checkpoints.py import/export, compare.py validation, and hf_megatron_roundtrip_multi_gpu.py calls) inherit the stricter error handling.examples/models/vlm/qwen35_vl/slurm_inference.sh (1)
146-147: Consider making sync wait time configurable or using a more robust synchronization method.The hardcoded
sleep 5may be insufficient for large clusters with slow shared filesystems. Consider using a file-based barrier or making the wait time configurable.♻️ Proposed improvement
+# Sync wait time (adjust for slow shared filesystems) +SYNC_WAIT_SECONDS=${SYNC_WAIT_SECONDS:-10} + # Only rank 0 on each node runs uv sync -SYNC_CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 5; fi" +SYNC_CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep \$SYNC_WAIT_SECONDS; fi"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/vlm/qwen35_vl/slurm_inference.sh` around lines 146 - 147, The sync logic uses a hardcoded sleep ("SYNC_CMD" and "FULL_CMD") which can fail on slow clusters; modify SYNC_CMD to use a configurable wait (e.g., read WAIT_SECONDS from env with a sensible default) or replace the sleep-based gate with a file-based barrier: have the rank 0 node run "uv sync" then create a marker file (e.g., /tmp/uv_sync_done_${SLURM_JOB_ID}") and have other ranks poll for that file with a timeout; update references to SYNC_CMD and FULL_CMD accordingly and ensure polling/backoff and a configurable timeout are used instead of fixed "sleep 5".src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py (1)
218-275: Consider extracting shared param_mappings to reduce duplication.The
param_mappingsdictionaries inQwen35VLMoEBridge.mapping_registryandQwen35VLBridge.mapping_registryshare significant overlap (vision model mappings, embeddings, GDN mappings, etc.). Consider extracting common mappings to a shared constant or helper method.This would reduce maintenance burden when vision model mappings change. However, keeping them separate also has merit for clarity and allowing variant-specific divergence, so this is optional.
Also applies to: 489-542
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py` around lines 218 - 275, Duplicate param_mappings exist between Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry; extract the shared entries (e.g., vision model mappings, embeddings, GDN mappings) into a single module-level constant like COMMON_PARAM_MAPPINGS and then compose each class's mapping_registry by copying/unpacking COMMON_PARAM_MAPPINGS and applying bridge-specific overrides (preserving keys used in Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry); update both mapping_registry definitions to reference the shared constant (keep variant-specific differences after unpacking) and run tests for correctness (also apply same extraction to the duplicate block at the other location referenced around lines 489-542).src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py (1)
35-35: Consider using modern union syntax for type hints.Per coding guidelines, use
T | Noneinstead ofOptional[T]for nullable types.♻️ Proposed fix
-from typing import Any, Callable, List, Optional +from typing import Any, Callable, ListThen update usages:
- mtp_num_layers: Optional[int] = None + mtp_num_layers: int | None = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py` at line 35, Replace usages of Optional[...] with the modern union syntax T | None across this module and remove Optional from the import list; specifically update the import line that currently includes "Optional" (the same line importing Any, Callable, List, Optional) to drop Optional and refactor all type annotations that use Optional[...] to use `| None` (e.g., change `FooOptionalType: Optional[Bar]` to `FooOptionalType: Bar | None`). Ensure function signatures and class attributes inside qwen35_vl_provider.py that reference Optional are updated to the new syntax and that no lingering Optional imports remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/conversion/compare_hf_and_megatron/compare.py`:
- Around line 322-328: The tuple handling in compare.py around model_output is
assuming exactly two elements which can raise if arity differs; change the
branch that handles tuple outputs to always take the first element (e.g.,
output_tensor = model_output[0]) instead of unpacking two values, keeping the
else branch that assigns output_tensor = model_output for non-tuples and
returning output_tensor, loss_func; use the existing symbols model,
forward_args, model_output, output_tensor, and loss_func to locate and update
the logic.
- Around line 619-622: The loop that disables MTP for inference blindly sets
m.config.mtp_num_layers and m.config.grad_scale_func and will crash if those
attributes are absent; update the code in the megatron_model iteration to check
for the attributes (e.g., using hasattr or getattr) on m.config before assigning
and only set them to None when they exist (references: megatron_model,
m.config.mtp_num_layers, m.config.grad_scale_func).
In `@examples/models/vlm/qwen35_vl/inference.sh`:
- Line 17: Replace the hardcoded WORKSPACE assignment with a shell
parameter-expansion default so the script uses the caller's WORKSPACE if set;
specifically change the WORKSPACE variable assignment (symbol WORKSPACE) to use
${WORKSPACE:-/workspace} (and append /models/Qwen as needed) instead of the
fixed /chcui/mbridge_home/models/Qwen path so other users won’t be forced to the
developer’s local location.
In `@examples/models/vlm/qwen35_vl/slurm_inference.sh`:
- Around line 119-128: The GPU validation uses an incorrect formula; replace the
max(TP, EP)×PP logic by computing REQUIRED_GPUS = TP × PP × CP × DP (introduce
CP and DP variables), ensure DP is documented/derived as DP = EP × EDP if
applicable, and update the validation that compares REQUIRED_GPUS to TOTAL_GPUS;
adjust the echo messages to report TP, PP, CP, DP, EP and the relationship
between EP and DP so the check and diagnostic output in the slurm_inference.sh
script reference the new variables (TP, EP, PP, CP, DP, REQUIRED_GPUS,
TOTAL_GPUS).
In `@src/megatron/bridge/models/conversion/param_mapping.py`:
- Around line 1823-1826: The docstring bullet lines for the projection names
(symbols: in_proj_qkv, in_proj_z, in_proj_b, in_proj_a) use EN DASH characters
(–) which triggers RUF002; edit the docstring in
src/megatron/bridge/models/conversion/param_mapping.py and replace each EN DASH
with a standard ASCII hyphen-minus (-) in those bullet lines (and any other
bullets nearby) so the linter no longer flags ambiguous dash characters.
- Around line 2537-2582: The helper _fuse_gdn_separate_to_grouped assumes exact
flat shapes and currently lets torch.split/reshape raise cryptic errors; add
explicit shape validation at the top of _fuse_gdn_separate_to_grouped (and the
symmetric helper referenced around lines 2585-2623, e.g.
_fuse_gdn_grouped_to_separate) that computes qk_dim = config.linear_key_head_dim
* config.linear_num_key_heads, v_dim = config.linear_value_head_dim *
config.linear_num_value_heads and v_per_group = config.linear_num_value_heads //
config.linear_num_key_heads and then checks: qkv.shape == (qk_dim*2 + v_dim,
hidden_size), z.shape == (v_dim, hidden_size), b.shape ==
(config.linear_num_value_heads, hidden_size) or (config.linear_num_key_heads,
v_per_group, hidden_size) as appropriate, and a.shape similarly; if a check
fails raise a ValueError with a clear message naming the parameter (qkv/z/b/a),
the expected shape derived from config (list the dims) and the actual shape so
conversion fails fast and actionable before any torch.reshape/split is
attempted.
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.py`:
- Around line 105-111: The flash-decode early-return path skips applying the
attention output gate when self.config.attention_output_gate is true, causing
divergent decode behavior; update the flash-decode branch in the attention
forward code to call get_query_key_value_tensors(..., output_gate=True) when
attention_output_gate is enabled, capture the returned gate, and apply that same
gating logic to the flash-decode output before returning (mirror the gating used
in the non-early-return path so both paths use the gate variable returned by
get_query_key_value_tensors).
In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Line 174: Rename the misspelled attribute name hetereogenous_dist_checkpoint
to heterogeneous_dist_checkpoint in qwen35_vl_provider.py and update all
references/usages to that symbol (e.g., any assignments, checks, or constructor
params that use hetereogenous_dist_checkpoint) so the code compiles and keeps
consistent naming across the module.
- Line 338: The variable name hetereogenous_dist_checkpoint is misspelled;
rename the symbol to heterogeneous_dist_checkpoint across the module (definition
and all references) so all assignments, argument names, and uses (e.g., the
variable in qwen35_vl_provider.py and any functions or classes that accept it)
consistently use the corrected spelling and update any related type hints or
docstrings accordingly.
- Around line 420-443: In _patch_standard_attention_specs, the
issubclass(attn_spec.module, SelfAttention) call can raise TypeError when
attn_spec.module is not a class; change the check to safely determine class-ness
before using issubclass (e.g., test isinstance(attn_spec.module, type) and then
issubclass, or wrap issubclass in a try/except TypeError), and ensure you still
only patch modules that both qualify as SelfAttention subclasses and expose the
linear_qkv submodule (use attn_spec.module and attn_spec.submodules.linear_qkv
as guards) so GDN-like modules remain unchanged.
---
Nitpick comments:
In `@examples/models/vlm/qwen35_vl/conversion.sh`:
- Around line 16-52: Add shell strict mode to the script by enabling "set -euo
pipefail" at the top (before the WORKSPACE/MODEL_NAME/HF_MODEL_CLASS assignments
and any uv/python commands) so the script exits on errors, treats unset
variables as failures, and fails on pipe errors; ensure this is the very first
executable line so all subsequent operations (the uv run convert_checkpoints.py
import/export, compare.py validation, and hf_megatron_roundtrip_multi_gpu.py
calls) inherit the stricter error handling.
In `@examples/models/vlm/qwen35_vl/inference.sh`:
- Around line 15-43: The script lacks shell strict mode; add a strict-mode line
(set -euo pipefail) near the top of the inference.sh (before WORKSPACE and
MODEL_NAME definitions) so the shell exits on errors, treats unset variables as
errors, and fails pipelines properly; ensure this change is applied to the top
of examples/models/vlm/qwen35_vl/inference.sh so all three uv run blocks benefit
from the stricter error handling.
In `@examples/models/vlm/qwen35_vl/slurm_inference.sh`:
- Around line 146-147: The sync logic uses a hardcoded sleep ("SYNC_CMD" and
"FULL_CMD") which can fail on slow clusters; modify SYNC_CMD to use a
configurable wait (e.g., read WAIT_SECONDS from env with a sensible default) or
replace the sleep-based gate with a file-based barrier: have the rank 0 node run
"uv sync" then create a marker file (e.g., /tmp/uv_sync_done_${SLURM_JOB_ID}")
and have other ranks poll for that file with a timeout; update references to
SYNC_CMD and FULL_CMD accordingly and ensure polling/backoff and a configurable
timeout are used instead of fixed "sleep 5".
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py`:
- Around line 87-88: Add an inline comment or short docstring next to the
conditional that checks hasattr(language_transformer_layer_spec, "submodules")
to explain it is intentional: ModuleSpecs coming from
get_gpt_layer_with_transformer_engine_spec() contain a .submodules structure and
need the patch that assigns Qwen3VLSelfAttention, whereas
TransformerBlockSubmodules are pre-patched by _patch_standard_attention_specs()
(used by Qwen3.5 VL) and therefore should be skipped; reference both callers
(get_gpt_layer_with_transformer_engine_spec and _patch_standard_attention_specs)
and the Qwen3VLSelfAttention assignment to make the intent clear for future
readers.
In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py`:
- Around line 218-275: Duplicate param_mappings exist between
Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry; extract
the shared entries (e.g., vision model mappings, embeddings, GDN mappings) into
a single module-level constant like COMMON_PARAM_MAPPINGS and then compose each
class's mapping_registry by copying/unpacking COMMON_PARAM_MAPPINGS and applying
bridge-specific overrides (preserving keys used in
Qwen35VLMoEBridge.mapping_registry and Qwen35VLBridge.mapping_registry); update
both mapping_registry definitions to reference the shared constant (keep
variant-specific differences after unpacking) and run tests for correctness
(also apply same extraction to the duplicate block at the other location
referenced around lines 489-542).
In `@src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`:
- Line 35: Replace usages of Optional[...] with the modern union syntax T | None
across this module and remove Optional from the import list; specifically update
the import line that currently includes "Optional" (the same line importing Any,
Callable, List, Optional) to drop Optional and refactor all type annotations
that use Optional[...] to use `| None` (e.g., change `FooOptionalType:
Optional[Bar]` to `FooOptionalType: Bar | None`). Ensure function signatures and
class attributes inside qwen35_vl_provider.py that reference Optional are
updated to the new syntax and that no lingering Optional imports remain.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
examples/conversion/compare_hf_and_megatron/compare.pyexamples/conversion/hf_megatron_roundtrip_multi_gpu.pyexamples/models/vlm/qwen35_vl/conversion.shexamples/models/vlm/qwen35_vl/inference.shexamples/models/vlm/qwen35_vl/slurm_inference.shsrc/megatron/bridge/models/__init__.pysrc/megatron/bridge/models/conversion/param_mapping.pysrc/megatron/bridge/models/qwen_vl/__init__.pysrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/attention.pysrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.pysrc/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.pysrc/megatron/bridge/models/qwen_vl/qwen35_vl_provider.pysrc/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py
| # Workaround: disable MTP for inference (causes hangs on NCCL collectives) | ||
| for m in megatron_model: | ||
| m.config.mtp_num_layers = None | ||
| m.config.grad_scale_func = None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /tmp/repo && head -650 examples/conversion/compare_hf_and_megatron/compare.py | tail -50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 130
🏁 Script executed:
# Get more context around the problematic lines
cd /tmp/repo && sed -n '600,630p' examples/conversion/compare_hf_and_megatron/compare.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 130
🏁 Script executed:
# Search for mtp_num_layers and grad_scale_func definitions and usages
rg -n "mtp_num_layers|grad_scale_func" --type=py src examplesRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 4815
🏁 Script executed:
# Check what megatron_model is - search backwards from line 619 to find its assignment
cd /tmp/repo && sed -n '1,625p' examples/conversion/compare_hf_and_megatron/compare.py | grep -n "megatron_model\s*=" | tail -5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 130
🏁 Script executed:
# Look for config class definitions in megatron/core to understand structure
fd "config" src/megatron --type f -name "*.py" | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 301
🏁 Script executed:
sed -n '580,635p' examples/conversion/compare_hf_and_megatron/compare.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2450
🏁 Script executed:
# Check the vlm/base.py file that does the same workaround
sed -n '85,105p' src/megatron/bridge/inference/vlm/base.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 669
🏁 Script executed:
# Search for how megatron_model is populated before line 619
rg -n "megatron_model\s*=|megatron_model\.append|megatron_model\[" examples/conversion/compare_hf_and_megatron/compare.py -A 3Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 469
🏁 Script executed:
# Check if TransformerConfig or other config classes define these fields
rg -n "class.*Config|mtp_num_layers.*=|grad_scale_func.*=" src/megatron/bridge/models -A 2 -B 2 | grep -A 3 -B 3 "class\|mtp_num_layers\|grad_scale_func"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 15938
Guard the MTP workaround with attribute checks.
The code unconditionally assigns to m.config.mtp_num_layers and m.config.grad_scale_func without verifying they exist. These fields are optional across different config types (as shown by Optional[int] definitions in provider configs and defensive getattr() patterns elsewhere in the codebase). Configs without these attributes will fail during this workaround.
Proposed fix
for m in megatron_model:
- m.config.mtp_num_layers = None
- m.config.grad_scale_func = None
+ if hasattr(m.config, "mtp_num_layers"):
+ m.config.mtp_num_layers = None
+ if hasattr(m.config, "grad_scale_func"):
+ m.config.grad_scale_func = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Workaround: disable MTP for inference (causes hangs on NCCL collectives) | |
| for m in megatron_model: | |
| m.config.mtp_num_layers = None | |
| m.config.grad_scale_func = None | |
| # Workaround: disable MTP for inference (causes hangs on NCCL collectives) | |
| for m in megatron_model: | |
| if hasattr(m.config, "mtp_num_layers"): | |
| m.config.mtp_num_layers = None | |
| if hasattr(m.config, "grad_scale_func"): | |
| m.config.grad_scale_func = None |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/conversion/compare_hf_and_megatron/compare.py` around lines 619 -
622, The loop that disables MTP for inference blindly sets
m.config.mtp_num_layers and m.config.grad_scale_func and will crash if those
attributes are absent; update the code in the megatron_model iteration to check
for the attributes (e.g., using hasattr or getattr) on m.config before assigning
and only set them to None when they exist (references: megatron_model,
m.config.mtp_num_layers, m.config.grad_scale_func).
There was a problem hiding this comment.
setting them as none has no issues
There was a problem hiding this comment.
@cuichenx, understood! I've noted that setting these config attributes to None works without issues in this context. Thanks for clarifying.
✏️ Learnings added
Learnt from: cuichenx
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2530
File: examples/conversion/compare_hf_and_megatron/compare.py:619-622
Timestamp: 2026-02-27T02:48:46.395Z
Learning: In `examples/conversion/compare_hf_and_megatron/compare.py`, when disabling MTP for inference by setting `m.config.mtp_num_layers = None` and `m.config.grad_scale_func = None` on megatron_model components, no attribute existence checks are needed—setting these to None works without issues even if the attributes don't initially exist.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: HollowMan6
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2387
File: src/megatron/bridge/models/mimo/mimo_bridge.py:49-113
Timestamp: 2026-02-14T16:32:37.992Z
Learning: In `src/megatron/bridge/models/mimo/mimo_bridge.py`, the mapping_registry method intentionally iterates over both "transformer_layer" and "mtp_model_layer" prefixes for MTP layer mappings. This dual-prefix support ensures compatibility because Megatron-Core may expose the same MTP submodule under either naming convention, allowing HF MTP weights to load correctly without requiring configuration overrides.
Learnt from: aroshanghias-nvd
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2040
File: src/megatron/bridge/models/mimo/llava_provider.py:109-113
Timestamp: 2026-01-28T17:39:49.959Z
Learning: In `src/megatron/bridge/models/mimo/llava_provider.py`, the `TransformerConfig` used for the `MultimodalProjector` includes `num_attention_heads=1` as a required placeholder. While the MLP projector doesn't use attention, `TransformerConfig` requires a positive value for `num_attention_heads` (default of 0 is not valid), so using 1 as a placeholder is the correct approach.
Learnt from: aroshanghias-nvd
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2007
File: src/megatron/bridge/data/mimo/loaders.py:0-0
Timestamp: 2026-01-28T20:36:48.499Z
Learning: In MIMO data loaders (`src/megatron/bridge/data/mimo/loaders.py`), use `cfg.train.micro_batch_size` directly instead of calculating `cfg.train.global_batch_size // dp_size`, as the latter ignores gradient accumulation steps. The correct relationship is `global_batch_size = micro_batch_size × dp_size × num_micro_batches`. This matches the standard pattern used in Bridge's non-MIMO data loaders.
Learnt from: CR
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2026-01-28T19:56:04.048Z
Learning: Applies to src/megatron/bridge/models/**/*.py : Handle tensor parallel and pipeline parallel distribution correctly in weight conversion
Learnt from: yaoyu-33
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2342
File: src/megatron/bridge/models/qwen_omni/thinker_model.py:326-345
Timestamp: 2026-02-13T00:22:54.549Z
Learning: In Python implementations (e.g., src/megatron/bridge/models/qwen_omni/thinker_model.py), when a feature is not supported (such as audio embeddings), raise an explicit error (e.g., NotImplementedError) instead of silently ignoring the input to fail fast with a clear message. Use a descriptive exception message that identifies the unsupported feature and the expected behavior.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
|
/ok to test 1a17945 |
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
What does this PR do?
Add Megatron Bridge support for Qwen3.5 Vision-Language models, enabling bidirectional HF ↔ Megatron checkpoint conversion for both the dense and MoE variants.
Changelog
Qwen35VLBridge(dense, e.g. Qwen3.5-27B) andQwen35VLMoEBridge(MoE, e.g. Qwen3.5-397B-A17B) with full weight mapping registriesQwen35VLModelProviderandQwen35VLMoEModelProviderfor Megatron-Core model instantiation with hybrid Gated DeltaNet + Gated Attention architectureGDNLinearMappingSeparateparam mapping to handle Qwen3.5's separate QKV/Z/B/A projection layout (vs. Qwen3-Next's fused layout), plus_fuse_gdn_separate_to_grouped/_split_gdn_grouped_to_separatehelperscompare.pyroundtrip comparison scriptGitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Qwen25VLBridge,Qwen3VLBridge,Qwen3VLMoEBridge) to the new Qwen3.5 generationtransformers >= 5.2.0for MoE config class; dense config is guarded behind a separate import checkSummary by CodeRabbit
New Features
Bug Fixes