[OMNIML2914] Support Nemotron-3-Nano PTQ, TE spec migration, and VLM quantization (Qwen3-VL)#1742
Conversation
f7453ef to
3526cf1
Compare
📝 WalkthroughWalkthroughThis PR introduces quantization support for Vision-Language Models (VLMs), particularly Qwen3-VL, by adding new quantization utilities, VLM-specific quantization and generation scripts, updating the Qwen3-VL model bridge with additional parameter mappings, and adding comprehensive functional test coverage for the new workflows. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CLI
participant Main as Main Entry<br/>(quantize_vlm)
participant Bridge as AutoBridge
participant Processor as AutoProcessor
participant MegatronMgr as Megatron<br/>ModelProvider
participant Quantizer as ModelOpt<br/>Quantizer
participant SaveMgr as Checkpoint<br/>Manager
User->>Main: Call with HF model ID,<br/>parallelism config
Main->>Bridge: Load HF VLM
Bridge-->>Main: Wrapped model instance
Main->>Processor: Load text/image processor
Processor-->>Main: Processor instance
Main->>MegatronMgr: Configure TP/PP/EP/ETP
Main->>MegatronMgr: Initialize Megatron model
MegatronMgr-->>Main: Initialized model
Main->>Main: Select calibration data<br/>(COCO or random)
Main->>Quantizer: Run quantization with<br/>forward loop
Quantizer->>Quantizer: Apply PTQ passes
Quantizer-->>Main: Quantized model
Main->>SaveMgr: Optionally compress weights
Main->>SaveMgr: Save quantized checkpoint
SaveMgr-->>Main: Save complete
Main->>Main: Run test prompt/image<br/>forward pass
Main-->>User: Generation output & stats
sequenceDiagram
participant User as User/CLI
participant Main as Main Entry<br/>(ptq_generate_vlm)
participant Bridge as AutoBridge
participant Processor as AutoProcessor
participant MegatronMgr as Megatron<br/>ModelProvider
participant ChkptLoader as Checkpoint<br/>Loader
participant Validator as Quantization<br/>Validator
participant Generator as Generation<br/>Loop
User->>Main: Call with quantized<br/>checkpoint path
Main->>Main: Validate paths &<br/>environment
Main->>Bridge: Load HF model
Bridge-->>Main: Model instance
Main->>Processor: Load processor
Processor-->>Main: Processor instance
Main->>MegatronMgr: Configure parallelism
Main->>ChkptLoader: Load quantized checkpoint
ChkptLoader-->>Main: Loaded state dict
Main->>Main: Apply to Megatron model
Main->>Validator: Validate quantized layers<br/>(TE-spec layers present)
Validator-->>Main: Validation passed/failed
alt Validation Success
Main->>Generator: Run generation loop<br/>with image & prompts
Generator-->>Main: Generation outputs
Main-->>User: Output messages & results
else Validation Failed
Main-->>User: Error: Missing quantized layers
end
Main->>Main: Cleanup distributed<br/>process group
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py (1)
177-194: Add type hints and explicitly mark the new forward params as unused.Ruff flags both parameters as unused (ARG002), and they lack type hints. If they are placeholders for API compatibility, either plumb them through or explicitly mark them unused. As per coding guidelines, please add explicit type hints for new parameters.
🛠️ Suggested fix
- inference_context=None, - runtime_gather_output=None, + inference_context: object | None = None, + runtime_gather_output: bool | None = None, ) -> torch.Tensor: """Forward function of the Qwen3VL model. @@ Returns: output (torch.Tensor): Loss of shape [b, s] if labels are provided, otherwise logits of shape [b, s, vocab_size]. """ + del inference_context, runtime_gather_output assert pixel_values_videos is None and video_grid_thw is None, "not support video now"
🤖 Fix all issues with AI agents
In `@examples/quantization/ptq_generate_vlm.py`:
- Around line 83-103: The file contains debug-only console.print blocks that
dump model_str and per-layer checks (using is_rank_0, model_str, te_spec_layers,
console.print); remove these debug print sections before merging or gate them
behind a CLI flag (e.g., --verbose/--debug) so the prints only run when enabled;
update any argument parsing to add the flag and wrap the existing debug blocks
with a conditional on that flag (or delete the blocks entirely) to prevent
unsolicited debug output in normal runs.
- Line 267: The CLI flag is ineffective because
parser.add_argument("--trust-remote-code", action="store_true", default=True)
always yields True; change it to a negated flag so users can disable the
default: replace that call with parser.add_argument("--no-trust-remote-code",
action="store_false", dest="trust_remote_code", default=True, help="disable
trusting remote code") and ensure any code using the trust_remote_code variable
(e.g., where the main call or model loader consumes trust_remote_code) continues
to reference the same name.
In `@examples/quantization/quantize_utils.py`:
- Around line 43-83: The function get_modelopt_torch_quantization_config mutates
the mtq_config taken from QUANT_CFG_CHOICES causing global side effects across
calls; fix this by making a deep copy of QUANT_CFG_CHOICES[export_quant_cfg]
(e.g., mtq_config = deepcopy(QUANT_CFG_CHOICES[export_quant_cfg])) before any
modifications so changes are local, and add an explicit return type hint (e.g.,
-> Dict[str, Any]) to the function signature; ensure deepcopy is imported and
update any type imports as needed.
In `@examples/quantization/quantize_vlm.py`:
- Around line 174-215: Move the restoration of per-module TopKRouter.topk to
module.config.moe_router_topk out of the calibration loop so forcing all-expert
routing covers the entire dataloader; specifically, keep the initial loop that
sets module.topk = module.num_experts (iterating model.named_modules() and
checking isinstance(module, TopKRouter)) before the dataloader loop and place
the restoration loop (setting module.topk = module.config.moe_router_topk)
immediately after the for messages in tqdm(...) loop completes (not inside it).
Also address the B007 static analysis hint by renaming the unused loop variable
name to _ in both places where you iterate model.named_modules() to avoid
unused-variable warnings.
- Around line 392-420: The save-path logic is duplicated: megatron_save_path is
already defaulted when None near the top, so remove the second conditional (the
if megatron_save_path / else block) and simply use save_path =
megatron_save_path before calling bridge.save_megatron_model; if you want a
console notice when a default was used, print it at the first assignment where
you set megatron_save_path (reference symbols: megatron_save_path, model_name,
save_path, bridge.save_megatron_model).
In `@src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py`:
- Around line 112-116: The code replaces multi-dimensional attention_mask with
all-ones, discarding padding; instead collapse extra dimensions into a 2D
[batch, seq] mask by reducing (logical OR) across the extra dims so padding
zeros are preserved: compute a 2D mask from attention_mask (e.g., reduce with
torch.any over the non-batch/non-sequence axes) and then ensure its shape
matches total_input_ids.size(1) before computing position_ids; update the branch
handling attention_mask.dim() > 2 to produce this reduced mask rather than
torch.ones_like(total_input_ids), referencing attention_mask, total_input_ids,
and position_ids in the change.
In `@src/megatron/bridge/training/model_load_save.py`:
- Around line 270-273: The debug block in build_and_load_model uses a broad
except Exception when calling _os.listdir(checkpoint_path); change this to catch
a more specific exception (e.g., FileNotFoundError or OSError) or remove the
debug code entirely; update the except clause to catch FileNotFoundError (or
OSError) and log the error message, referencing build_and_load_model,
checkpoint_path, and the _os.listdir call so the fix is applied to the correct
snippet.
- Around line 258-306: Remove the temporary DEBUG print statements introduced in
build_and_load_model and replace them with concise logger.debug calls or delete
them entirely; keep only the functional changes that set
model_cfg.restore_modelopt_state and model_cfg.modelopt_use_te when
has_modelopt_state(checkpoint_path) is true. Locate the block that calls
has_modelopt_state(checkpoint_path) and uses
_supports_modelopt_te_spec(hf_model_id) (references: has_modelopt_state,
_supports_modelopt_te_spec, model_cfg.restore_modelopt_state,
model_cfg.modelopt_use_te) and remove all plain print(...) debug lines,
optionally converting important diagnostic lines to logger.debug(...) with
minimal, non-verbose messages.
In `@tests/functional_tests/L2_Launch_models_qwen_vl_quantization.sh`:
- Around line 1-15: Move the shebang (#!/bin/bash) to the very first line of the
script so the shell can recognize it; update the file so that the current
copyright and license header follow the shebang instead of preceding it (i.e.,
place the existing header and comments after the shebang line), ensuring the
shebang remains exactly as shown and unchanged.
In
`@tests/functional_tests/quantization/models/qwen_vl/test_qwen3_vl_quantization_workflow.py`:
- Around line 326-329: Replace the test failures that use "assert False,
f'...{quantize_result.returncode}'" with pytest.fail(...) so failures still
occur under Python -O; specifically change the block checking
quantize_result.returncode (and the analogous checks at the other locations
referenced) to call pytest.fail with a clear message that includes
quantize_result.returncode and optionally stdout/stderr, and ensure pytest is
imported at the top of the test module so pytest.fail is available; update the
occurrences tied to the quantize_result checks at the locations indicated
(including the other lines mentioned) to use pytest.fail instead of assert
False.
🧹 Nitpick comments (6)
src/megatron/bridge/training/model_load_save.py (1)
259-259: Redundant import:osis already imported at module level.Line 17 already imports
os. Usingimport os as _osinside the function is unnecessary and confusing.examples/quantization/ptq_generate_vlm.py (1)
270-284: Wrapmain()call with try/finally for process group cleanup.If
main()raises an exception,torch.distributed.destroy_process_group()will not be called, potentially leaving dangling processes.Suggested fix
args = parser.parse_args() - main( - args.hf_model_id, - args.tp, - args.pp, - args.ep, - args.etp, - args.megatron_load_path, - args.prompts, - args.osl, - args.image_path, - args.trust_remote_code, - ) - - if torch.distributed.is_initialized(): - torch.distributed.destroy_process_group() + try: + main( + args.hf_model_id, + args.tp, + args.pp, + args.ep, + args.etp, + args.megatron_load_path, + args.prompts, + args.osl, + args.image_path, + args.trust_remote_code, + ) + finally: + if torch.distributed.is_initialized(): + torch.distributed.destroy_process_group()examples/quantization/quantize_vlm.py (4)
40-40: UseT | Noneinstead ofOptional[T]per coding guidelines.Suggested fix
-from typing import Generator, Optional +from typing import GeneratorThen update the type hints in function signatures:
- megatron_save_path: Optional[str] = None, + megatron_save_path: str | None = None, ... - test_image_path: Optional[str] = None, + test_image_path: str | None = None,
119-153: Consider adding a seed parameter for reproducibility.The random calibration data is non-reproducible across runs. For CI/CD debugging and reproducibility, consider adding an optional seed parameter.
Suggested improvement
def get_random_calib_dataloader( calib_size: int = 512, - image_size: tuple = (224, 224), + image_size: tuple[int, int] = (224, 224), + seed: int | None = None, ) -> Generator[dict, None, None]: ... import numpy as np from PIL import Image + if seed is not None: + np.random.seed(seed) + for i in range(calib_size):
218-226: Consider adding type hints formodelandprocessorparameters.The function parameters lack type hints. While acceptable for an example script, adding hints improves IDE support and documentation.
Example
def _custom_prompt_forward_loop_func( - model, - processor, + model: torch.nn.Module, + processor: AutoProcessor, is_rank_0: bool, prompts: str, osl: int = 32, test_image_path: str = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", ):
462-483: Wrapmain()call with try/finally for process group cleanup.Same issue as
ptq_generate_vlm.py- ifmain()raises, the process group won't be destroyed.Suggested fix
args = parser.parse_args() - main( - args.hf_model_id, - ... - args.use_random_calib, - ) - - if torch.distributed.is_initialized(): - torch.distributed.destroy_process_group() + try: + main( + args.hf_model_id, + ... + args.use_random_calib, + ) + finally: + if torch.distributed.is_initialized(): + torch.distributed.destroy_process_group()
tests/functional_tests/quantization/models/qwen_vl/test_qwen3_vl_quantization_workflow.py
Outdated
Show resolved
Hide resolved
| model_cfg.restore_modelopt_state = True | ||
| # Check if the model supports TE spec for modelopt (e.g., Qwen3-8B) | ||
| # If so, set modelopt_use_te=True to use TE spec instead of local spec | ||
| hf_model_id = getattr(model_cfg, "hf_model_id", None) |
There was a problem hiding this comment.
do you need to use this? we designed it only for deployment repo, dont want other parts rely on this id, because it main contain a hf local file path.
There was a problem hiding this comment.
I see, I am trying to grasp the hf id for the specific model and calls this function to decide if this model is supported to run PTQ with TE spec. Once all the models are supported for PTQ with TE spec, this attribute, modelopt_use_te will be deprecated. Could you suggest another attribute that I can grasp the hf id?
|
/ok to test 4dfc4c7 |
|
/ok to test 4cc9adb |
|
/ok to test adeb776 |
Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
|
/ok to test 51d298f |
What does this PR do?
This PR adds support for Post-Training Quantization (PTQ) and quantized checkpoint resume for large language models and vision-language models (VLMs) using Megatron-Bridge, with a shift from local spec to TE (Tensor Engine) spec for ModelOpt quantization.
Specifically:
Changelog
quantize_vlm.py,ptq_generate_vlm.py) with configurable parallelismUsage Examples
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
PTQ
Resume Quantized Checkpoint
Qwen3-VL-30B-A3B-Instruct
PTQ
torchrun --nproc_per_node=8 examples/quantization/quantize_vlm.py \ --hf-model-id /models/Qwen3-VL-30B-A3B-Instruct \ --export-quant-cfg fp8 \ --megatron-save-path /models/Qwen3-VL-30B-A3B-Instruct_fp8_mlm \ --tp 4 \ --etp 4 \ --pp 2 \ --calib-size 32Generate
torchrun --nproc_per_node=8 examples/quantization/ptq_generate_vlm.py \ --hf-model-id /models/Qwen3-VL-30B-A3B-Instruct \ --megatron-load-path /models/Qwen3-VL-30B-A3B-Instruct_fp8_mlm \ --tp 8 \ --ep 8 \ --image-path /models/demo.jpeg \ --prompts "Describe this image."Qwen3-VL-8B-Instruct
PTQ
torchrun --nproc_per_node=8 examples/quantization/quantize_vlm.py \ --hf-model-id /models/Qwen3-VL-8B-Instruct \ --export-quant-cfg fp8 \ --megatron-save-path /models/Qwen3-VL-8B-Instruct_fp8_mlm \ --tp 4 \ --pp 2 \ --calib-size 8GitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information