Skip to content

[OMNIML2914] Support Nemotron-3-Nano PTQ, TE spec migration, and VLM quantization (Qwen3-VL)#1742

Merged
yueshen2016 merged 4 commits intomainfrom
yueshen/PTQ-support-Qwen3-VL
Feb 26, 2026
Merged

[OMNIML2914] Support Nemotron-3-Nano PTQ, TE spec migration, and VLM quantization (Qwen3-VL)#1742
yueshen2016 merged 4 commits intomainfrom
yueshen/PTQ-support-Qwen3-VL

Conversation

@yueshen2016
Copy link
Copy Markdown
Contributor

@yueshen2016 yueshen2016 commented Dec 16, 2025

What does this PR do?

This PR adds support for Post-Training Quantization (PTQ) and quantized checkpoint resume for large language models and vision-language models (VLMs) using Megatron-Bridge, with a shift from local spec to TE (Tensor Engine) spec for ModelOpt quantization.

Specifically:

  1. Support PTQ and resume of quantized checkpoint for NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
  2. Change from local spec to TE spec, and deprecate local spec support for ModelOpt quantization
  3. Support VLM model PTQ with image as calibration data, using Qwen3-VL-30B-A3B-Instruct and Qwen3-VL-8B-Instruct as examples

Changelog

  • Added PTQ support for NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with multi-GPU parallelism (TP/PP/EP)
  • Added quantized checkpoint resume and generation for Nemotron-3-Nano
  • Migrated quantization layer support from local spec to TE spec; deprecated local spec for ModelOpt quantization
  • Added VLM quantization support with image calibration data (detection-datasets/coco)
  • Added quantization example scripts for VLM workflows (quantize_vlm.py, ptq_generate_vlm.py) with configurable parallelism
  • Added comprehensive Qwen3 VL quantization end-to-end test suite with multiple parallelism configurations
  • Fixed multi-dimensional attention mask handling

Usage Examples

NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

PTQ

torchrun --nproc_per_node 8 examples/quantization/quantize.py \
  --hf-model-id /models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
  --calib-size 4 \
  --export-quant-cfg nvfp4 \
  --megatron-save-path /models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4-MLM \
  --pp 4 \
  --tp 2 \
  --ep 2 \
  --trust-remote-code

Resume Quantized Checkpoint

torchrun --nproc_per_node 8 examples/quantization/ptq_generate.py \
  --megatron-load-path /models/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4-MLM \
  --hf-model-id /models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
  --trust-remote-code \
  --tp 8 \
  --ep 8

Qwen3-VL-30B-A3B-Instruct

PTQ

torchrun --nproc_per_node=8 examples/quantization/quantize_vlm.py \
    --hf-model-id /models/Qwen3-VL-30B-A3B-Instruct \
    --export-quant-cfg fp8 \
    --megatron-save-path /models/Qwen3-VL-30B-A3B-Instruct_fp8_mlm \
    --tp 4 \
    --etp 4 \
    --pp 2 \
    --calib-size 32

Generate

torchrun --nproc_per_node=8 examples/quantization/ptq_generate_vlm.py \
    --hf-model-id /models/Qwen3-VL-30B-A3B-Instruct \
    --megatron-load-path /models/Qwen3-VL-30B-A3B-Instruct_fp8_mlm \
    --tp 8 \
    --ep 8 \
    --image-path /models/demo.jpeg \
    --prompts "Describe this image."

Qwen3-VL-8B-Instruct

PTQ

torchrun --nproc_per_node=8 examples/quantization/quantize_vlm.py \
    --hf-model-id /models/Qwen3-VL-8B-Instruct \
    --export-quant-cfg fp8 \
    --megatron-save-path /models/Qwen3-VL-8B-Instruct_fp8_mlm \
    --tp 4 \
    --pp 2 \
    --calib-size 8

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

@yueshen2016 yueshen2016 requested review from yashaswikarnati and removed request for yashaswikarnati December 16, 2025 10:48
@yueshen2016 yueshen2016 force-pushed the yueshen/PTQ-support-Qwen3-VL branch from f7453ef to 3526cf1 Compare December 16, 2025 10:51
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

This PR introduces quantization support for Vision-Language Models (VLMs), particularly Qwen3-VL, by adding new quantization utilities, VLM-specific quantization and generation scripts, updating the Qwen3-VL model bridge with additional parameter mappings, and adding comprehensive functional test coverage for the new workflows.

Changes

Cohort / File(s) Summary
Quantization Utilities & Infrastructure
.github/workflows/cicd-main.yml, examples/quantization/quantize_utils.py, src/megatron/bridge/models/gpt_provider.py
Added new test script to CI pipeline; introduced centralized quantization config utilities module with configuration choices, table creation, and CLI argument helpers; added TE-spec support detection for specific models (Qwen3-8B) with conditional layer-spec selection logic and modelopt_use_te flag to GPTModelProvider.
VLM Quantization Scripts
examples/quantization/quantize_vlm.py, examples/quantization/ptq_generate_vlm.py
Introduced two new VLM quantization scripts: quantize_vlm.py for offline quantization with COCO/random calibration pipelines and checkpointing, and ptq_generate_vlm.py for loading and generating from quantized VLM checkpoints across multiple GPUs.
PTQ Script Updates
examples/quantization/ptq_generate.py, examples/quantization/quantize.py
Refactored to support dual-path quantization validation (local-spec and TE-spec layers); quantize.py now uses centralized quantize_utils and includes dynamic layer-spec selection based on TE support.
Qwen3-VL Model & Bridge
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py, src/megatron/bridge/models/qwen_vl/qwen3_vl_bridge.py
Extended forward signature with inference/gathering context parameters; normalized multi-dimensional attention masks to 2D; added layernorm parameter mappings for both standard and MoE variants.
Model Loading & Training
src/megatron/bridge/training/model_load_save.py
Enhanced checkpoint loading with modelopt state detection and debug logging to determine TE-spec usage during model restoration.
Test Infrastructure & Cases
tests/functional_tests/L2_Launch_models_qwen_vl_quantization.sh, tests/functional_tests/quantization/models/qwen_vl/__init__.py, tests/functional_tests/quantization/models/qwen_vl/test_qwen3_vl_quantization_workflow.py
Added new test runner script for VLM quantization with coverage collection; added comprehensive test class covering quantization workflow, generation from quantized checkpoints, and parallelism validation.
Test Cleanup
tests/functional_tests/quantization/models/qwen/test_qwen3_moe_quantization_workflow.py
Removed debug logging block that printed parameter dtype before saving.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant Main as Main Entry<br/>(quantize_vlm)
    participant Bridge as AutoBridge
    participant Processor as AutoProcessor
    participant MegatronMgr as Megatron<br/>ModelProvider
    participant Quantizer as ModelOpt<br/>Quantizer
    participant SaveMgr as Checkpoint<br/>Manager

    User->>Main: Call with HF model ID,<br/>parallelism config
    Main->>Bridge: Load HF VLM
    Bridge-->>Main: Wrapped model instance
    Main->>Processor: Load text/image processor
    Processor-->>Main: Processor instance
    Main->>MegatronMgr: Configure TP/PP/EP/ETP
    Main->>MegatronMgr: Initialize Megatron model
    MegatronMgr-->>Main: Initialized model
    Main->>Main: Select calibration data<br/>(COCO or random)
    Main->>Quantizer: Run quantization with<br/>forward loop
    Quantizer->>Quantizer: Apply PTQ passes
    Quantizer-->>Main: Quantized model
    Main->>SaveMgr: Optionally compress weights
    Main->>SaveMgr: Save quantized checkpoint
    SaveMgr-->>Main: Save complete
    Main->>Main: Run test prompt/image<br/>forward pass
    Main-->>User: Generation output & stats
Loading
sequenceDiagram
    participant User as User/CLI
    participant Main as Main Entry<br/>(ptq_generate_vlm)
    participant Bridge as AutoBridge
    participant Processor as AutoProcessor
    participant MegatronMgr as Megatron<br/>ModelProvider
    participant ChkptLoader as Checkpoint<br/>Loader
    participant Validator as Quantization<br/>Validator
    participant Generator as Generation<br/>Loop

    User->>Main: Call with quantized<br/>checkpoint path
    Main->>Main: Validate paths &<br/>environment
    Main->>Bridge: Load HF model
    Bridge-->>Main: Model instance
    Main->>Processor: Load processor
    Processor-->>Main: Processor instance
    Main->>MegatronMgr: Configure parallelism
    Main->>ChkptLoader: Load quantized checkpoint
    ChkptLoader-->>Main: Loaded state dict
    Main->>Main: Apply to Megatron model
    Main->>Validator: Validate quantized layers<br/>(TE-spec layers present)
    Validator-->>Main: Validation passed/failed
    alt Validation Success
        Main->>Generator: Run generation loop<br/>with image & prompts
        Generator-->>Main: Generation outputs
        Main-->>User: Output messages & results
    else Validation Failed
        Main-->>User: Error: Missing quantized layers
    end
    Main->>Main: Cleanup distributed<br/>process group
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

  • yaoyu-33
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major PTQ support for VLMs but lacks test results and performance benchmarks in the description. Update PR description with test execution results, validation outcomes, numerical correctness evidence, and performance metrics across different parallelism configurations.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 96.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The PR title comprehensively describes the main changes: PTQ support for Qwen3-VL (VLM quantization), TE spec migration, and Nemotron-3-Nano support, which align with the file summaries showing new quantization workflows, TE-spec support, and VLM examples.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yueshen/PTQ-support-Qwen3-VL

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (1)

177-194: Add type hints and explicitly mark the new forward params as unused.

Ruff flags both parameters as unused (ARG002), and they lack type hints. If they are placeholders for API compatibility, either plumb them through or explicitly mark them unused. As per coding guidelines, please add explicit type hints for new parameters.

🛠️ Suggested fix
-        inference_context=None,
-        runtime_gather_output=None,
+        inference_context: object | None = None,
+        runtime_gather_output: bool | None = None,
     ) -> torch.Tensor:
         """Forward function of the Qwen3VL model.
@@
         Returns:
             output (torch.Tensor): Loss of shape [b, s] if labels are provided, otherwise logits of shape
                 [b, s, vocab_size].
         """
+        del inference_context, runtime_gather_output
         assert pixel_values_videos is None and video_grid_thw is None, "not support video now"
🤖 Fix all issues with AI agents
In `@examples/quantization/ptq_generate_vlm.py`:
- Around line 83-103: The file contains debug-only console.print blocks that
dump model_str and per-layer checks (using is_rank_0, model_str, te_spec_layers,
console.print); remove these debug print sections before merging or gate them
behind a CLI flag (e.g., --verbose/--debug) so the prints only run when enabled;
update any argument parsing to add the flag and wrap the existing debug blocks
with a conditional on that flag (or delete the blocks entirely) to prevent
unsolicited debug output in normal runs.
- Line 267: The CLI flag is ineffective because
parser.add_argument("--trust-remote-code", action="store_true", default=True)
always yields True; change it to a negated flag so users can disable the
default: replace that call with parser.add_argument("--no-trust-remote-code",
action="store_false", dest="trust_remote_code", default=True, help="disable
trusting remote code") and ensure any code using the trust_remote_code variable
(e.g., where the main call or model loader consumes trust_remote_code) continues
to reference the same name.

In `@examples/quantization/quantize_utils.py`:
- Around line 43-83: The function get_modelopt_torch_quantization_config mutates
the mtq_config taken from QUANT_CFG_CHOICES causing global side effects across
calls; fix this by making a deep copy of QUANT_CFG_CHOICES[export_quant_cfg]
(e.g., mtq_config = deepcopy(QUANT_CFG_CHOICES[export_quant_cfg])) before any
modifications so changes are local, and add an explicit return type hint (e.g.,
-> Dict[str, Any]) to the function signature; ensure deepcopy is imported and
update any type imports as needed.

In `@examples/quantization/quantize_vlm.py`:
- Around line 174-215: Move the restoration of per-module TopKRouter.topk to
module.config.moe_router_topk out of the calibration loop so forcing all-expert
routing covers the entire dataloader; specifically, keep the initial loop that
sets module.topk = module.num_experts (iterating model.named_modules() and
checking isinstance(module, TopKRouter)) before the dataloader loop and place
the restoration loop (setting module.topk = module.config.moe_router_topk)
immediately after the for messages in tqdm(...) loop completes (not inside it).
Also address the B007 static analysis hint by renaming the unused loop variable
name to _ in both places where you iterate model.named_modules() to avoid
unused-variable warnings.
- Around line 392-420: The save-path logic is duplicated: megatron_save_path is
already defaulted when None near the top, so remove the second conditional (the
if megatron_save_path / else block) and simply use save_path =
megatron_save_path before calling bridge.save_megatron_model; if you want a
console notice when a default was used, print it at the first assignment where
you set megatron_save_path (reference symbols: megatron_save_path, model_name,
save_path, bridge.save_megatron_model).

In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py`:
- Around line 112-116: The code replaces multi-dimensional attention_mask with
all-ones, discarding padding; instead collapse extra dimensions into a 2D
[batch, seq] mask by reducing (logical OR) across the extra dims so padding
zeros are preserved: compute a 2D mask from attention_mask (e.g., reduce with
torch.any over the non-batch/non-sequence axes) and then ensure its shape
matches total_input_ids.size(1) before computing position_ids; update the branch
handling attention_mask.dim() > 2 to produce this reduced mask rather than
torch.ones_like(total_input_ids), referencing attention_mask, total_input_ids,
and position_ids in the change.

In `@src/megatron/bridge/training/model_load_save.py`:
- Around line 270-273: The debug block in build_and_load_model uses a broad
except Exception when calling _os.listdir(checkpoint_path); change this to catch
a more specific exception (e.g., FileNotFoundError or OSError) or remove the
debug code entirely; update the except clause to catch FileNotFoundError (or
OSError) and log the error message, referencing build_and_load_model,
checkpoint_path, and the _os.listdir call so the fix is applied to the correct
snippet.
- Around line 258-306: Remove the temporary DEBUG print statements introduced in
build_and_load_model and replace them with concise logger.debug calls or delete
them entirely; keep only the functional changes that set
model_cfg.restore_modelopt_state and model_cfg.modelopt_use_te when
has_modelopt_state(checkpoint_path) is true. Locate the block that calls
has_modelopt_state(checkpoint_path) and uses
_supports_modelopt_te_spec(hf_model_id) (references: has_modelopt_state,
_supports_modelopt_te_spec, model_cfg.restore_modelopt_state,
model_cfg.modelopt_use_te) and remove all plain print(...) debug lines,
optionally converting important diagnostic lines to logger.debug(...) with
minimal, non-verbose messages.

In `@tests/functional_tests/L2_Launch_models_qwen_vl_quantization.sh`:
- Around line 1-15: Move the shebang (#!/bin/bash) to the very first line of the
script so the shell can recognize it; update the file so that the current
copyright and license header follow the shebang instead of preceding it (i.e.,
place the existing header and comments after the shebang line), ensuring the
shebang remains exactly as shown and unchanged.

In
`@tests/functional_tests/quantization/models/qwen_vl/test_qwen3_vl_quantization_workflow.py`:
- Around line 326-329: Replace the test failures that use "assert False,
f'...{quantize_result.returncode}'" with pytest.fail(...) so failures still
occur under Python -O; specifically change the block checking
quantize_result.returncode (and the analogous checks at the other locations
referenced) to call pytest.fail with a clear message that includes
quantize_result.returncode and optionally stdout/stderr, and ensure pytest is
imported at the top of the test module so pytest.fail is available; update the
occurrences tied to the quantize_result checks at the locations indicated
(including the other lines mentioned) to use pytest.fail instead of assert
False.
🧹 Nitpick comments (6)
src/megatron/bridge/training/model_load_save.py (1)

259-259: Redundant import: os is already imported at module level.

Line 17 already imports os. Using import os as _os inside the function is unnecessary and confusing.

examples/quantization/ptq_generate_vlm.py (1)

270-284: Wrap main() call with try/finally for process group cleanup.

If main() raises an exception, torch.distributed.destroy_process_group() will not be called, potentially leaving dangling processes.

Suggested fix
     args = parser.parse_args()
-    main(
-        args.hf_model_id,
-        args.tp,
-        args.pp,
-        args.ep,
-        args.etp,
-        args.megatron_load_path,
-        args.prompts,
-        args.osl,
-        args.image_path,
-        args.trust_remote_code,
-    )
-
-    if torch.distributed.is_initialized():
-        torch.distributed.destroy_process_group()
+    try:
+        main(
+            args.hf_model_id,
+            args.tp,
+            args.pp,
+            args.ep,
+            args.etp,
+            args.megatron_load_path,
+            args.prompts,
+            args.osl,
+            args.image_path,
+            args.trust_remote_code,
+        )
+    finally:
+        if torch.distributed.is_initialized():
+            torch.distributed.destroy_process_group()
examples/quantization/quantize_vlm.py (4)

40-40: Use T | None instead of Optional[T] per coding guidelines.

Suggested fix
-from typing import Generator, Optional
+from typing import Generator

Then update the type hints in function signatures:

-    megatron_save_path: Optional[str] = None,
+    megatron_save_path: str | None = None,
...
-    test_image_path: Optional[str] = None,
+    test_image_path: str | None = None,

119-153: Consider adding a seed parameter for reproducibility.

The random calibration data is non-reproducible across runs. For CI/CD debugging and reproducibility, consider adding an optional seed parameter.

Suggested improvement
 def get_random_calib_dataloader(
     calib_size: int = 512,
-    image_size: tuple = (224, 224),
+    image_size: tuple[int, int] = (224, 224),
+    seed: int | None = None,
 ) -> Generator[dict, None, None]:
     ...
     import numpy as np
     from PIL import Image
 
+    if seed is not None:
+        np.random.seed(seed)
+
     for i in range(calib_size):

218-226: Consider adding type hints for model and processor parameters.

The function parameters lack type hints. While acceptable for an example script, adding hints improves IDE support and documentation.

Example
 def _custom_prompt_forward_loop_func(
-    model,
-    processor,
+    model: torch.nn.Module,
+    processor: AutoProcessor,
     is_rank_0: bool,
     prompts: str,
     osl: int = 32,
     test_image_path: str = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
 ):

462-483: Wrap main() call with try/finally for process group cleanup.

Same issue as ptq_generate_vlm.py - if main() raises, the process group won't be destroyed.

Suggested fix
     args = parser.parse_args()
-    main(
-        args.hf_model_id,
-        ...
-        args.use_random_calib,
-    )
-
-    if torch.distributed.is_initialized():
-        torch.distributed.destroy_process_group()
+    try:
+        main(
+            args.hf_model_id,
+            ...
+            args.use_random_calib,
+        )
+    finally:
+        if torch.distributed.is_initialized():
+            torch.distributed.destroy_process_group()

ChenhanYu
ChenhanYu previously approved these changes Jan 27, 2026
Copy link
Copy Markdown

@ChenhanYu ChenhanYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM;

model_cfg.restore_modelopt_state = True
# Check if the model supports TE spec for modelopt (e.g., Qwen3-8B)
# If so, set modelopt_use_te=True to use TE spec instead of local spec
hf_model_id = getattr(model_cfg, "hf_model_id", None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to use this? we designed it only for deployment repo, dont want other parts rely on this id, because it main contain a hf local file path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I am trying to grasp the hf id for the specific model and calls this function to decide if this model is supported to run PTQ with TE spec. Once all the models are supported for PTQ with TE spec, this attribute, modelopt_use_te will be deprecated. Could you suggest another attribute that I can grasp the hf id?

ko3n1g
ko3n1g previously approved these changes Jan 28, 2026
@yueshen2016
Copy link
Copy Markdown
Contributor Author

/ok to test 4dfc4c7

ko3n1g
ko3n1g previously approved these changes Feb 25, 2026
ko3n1g
ko3n1g previously approved these changes Feb 25, 2026
yaoyu-33
yaoyu-33 previously approved these changes Feb 25, 2026
@yueshen2016
Copy link
Copy Markdown
Contributor Author

/ok to test 4cc9adb

@yueshen2016
Copy link
Copy Markdown
Contributor Author

/ok to test adeb776

Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
@yueshen2016
Copy link
Copy Markdown
Contributor Author

/ok to test 51d298f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants