Support Mixed precision & Static MSE in MCore; Nemotron Super v3 NVFP4 recipe#1521
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds NVFP4 static-quantizer validation and restoration, per-layer quantization metadata recording in Megatron HF exports, Hugging Face Hub offline-mode support for sidecar copying, expert-parallel distributed synchronization in auto-quantize recipe selection, Nemotron PTQ recipes, and comprehensive test coverage across all new functionality. ChangesNVFP4 validation, Megatron mixed-precision export, and MoE expert-parallel support
🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
A continuation of #1363 |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/torch/export/unified_export_megatron.py (1)
818-828:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTreat
QUANTIZATION_NONEas unquantized when buildingexclude_modules.This branch only records excludes for
qformat is None, but the same method immediately returns early onqformat == QUANTIZATION_NONE, and_qkv_slicing()already treats both values the same. As written, any normal module reported asQUANTIZATION_NONEwill skip the HF ignore list even though it is still unquantized.Suggested fix
- if qformat is None and "norm" not in prefix: + if qformat in (None, QUANTIZATION_NONE) and "norm" not in prefix: self._record_excluded_module(prefix)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/export/unified_export_megatron.py` around lines 818 - 828, The code currently only calls _record_excluded_module(prefix) when qformat is None, but QUANTIZATION_NONE should be treated the same; update the branch in unified_export_megatron.py (the block around qformat, QUANTIZATION_NONE, _get_weight_bias, and _record_excluded_module) so that if qformat is None or qformat == QUANTIZATION_NONE (and "norm" not in prefix) you record the module as excluded before the early return; keep the existing early return for QUANTIZATION_NONE but ensure the exclude is recorded first and keep compatibility with _qkv_slicing behavior.modelopt/torch/quantization/algorithms.py (1)
765-782:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRecompute the persisted score/cost after recipe synchronization.
After
best_formatis replaced by the DP/TP/EP-synchronized value,best_constraintsandbest_scoresare still accumulated from the local solver choice. On ranks that did not originate the synchronized format,self.best["constraints"]/self.best["score"]can end up describing a different recipe than the one actually activated and checkpointed.Suggested fix
for name, best_hparam_recipe_info in best_recipe_info.items(): # Solvers could give different solutions for the same layer across DP/TP/EP groups even though # the scores and costs are the same. Lets make sure the same recipe is selected across DP/TP/EP _ps = self.model.get_submodule(name.split(".quant_recipe")[0]).parallel_state best_format = DistributedProcessGroup.get_dist_syncd_obj( best_hparam_recipe_info["format"], [ _ps.data_parallel_group, _ps.tensor_parallel_group, _ps.expert_model_parallel_group, ], lambda a: a[0], ) best_recipe[name] = best_format - get_hparam(self.model, name).active = best_format - best_constraints += best_hparam_recipe_info["costs"] - best_scores += best_hparam_recipe_info["scores"] + hparam = get_hparam(self.model, name) + hparam.active = best_format + best_constraints += hparam.get_cost(best_format) + best_scores += hparam.get_score(best_format)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/algorithms.py` around lines 765 - 782, The loop currently accumulates best_constraints and best_scores from best_hparam_recipe_info before replacing the local solver's format with the DP/TP/EP-synchronized best_format; update the code so that after you set best_recipe[name] = best_format and get_hparam(self.model, name).active = best_format you recompute and add the costs and scores that correspond to the actually activated best_format (not the original best_hparam_recipe_info["format"]); locate the mapping of format->costs/scores that the solver produced for the layer (referencing best_recipe_info, best_hparam_recipe_info and get_hparam) and use that entry to increment best_constraints and best_scores (and keep self.best["constraints"]/self.best["score"] consistent with the activated recipe).
🧹 Nitpick comments (1)
tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py (1)
32-42: ⚡ Quick winAdd one regression that uses only the restored
_global_amaxpath.The implementation change specifically supports static quantizers restored with
_global_amax, but this helper only seedsglobal_amax, so the new restore path is still untested. A single round-trip case that sets_global_amaxdirectly would keep the actual bugfix from regressing.As per coding guidelines,
tests/**/*.py: Write focused unit tests during development and curate production tests to be lean, documenting expected behavior, protecting against regressions, and flagging backward-incompatible changes.Also applies to: 45-70
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py` around lines 32 - 42, Add a focused unit test that exercises the restored _global_amax code path: create an NVFP4StaticQuantizer via the existing helper _make_static_quantizer (or directly instantiate NVFP4StaticQuantizer), set the private attribute _global_amax (not global_amax) to a tensor value, perform the export/import (or the same round‑trip flow used elsewhere in this test file) and assert the quantizer restores using the _global_amax path (e.g., resulting amax/global_amax behavior matches expected values). Ensure the test is small, documents the expected behavior, and only validates the single round‑trip regression scenario so the `_global_amax` restore remains covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 47-79: The routed-expert weight quantizers in this max-calib
recipe (entries with quantizer_name: '*mixer.experts.*weight_quantizer' and
'*mlp.experts*weight_quantizer') are set to type: dynamic but must be static for
a fair max-vs-MSE comparison; update those two quantizer blocks to use type:
static (leave the corresponding input_quantizer blocks as-is) so only the weight
quantizers for routed experts switch from dynamic to static.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 30-32: The calibration comment is misleading about FP8 scale
selection: update the comment near the calibration block that mentions "FP8
per-tensor scales" and "NVFP4 weights" (the lines describing MSE searches) to
explicitly state that only NVFP4 weight block scales are selected via MSE while
non-NVFP4 FP8 formats skip MSE and use the stack's default scaling method; edit
the text to clarify that FP8 per-tensor scales for non-NVFP4 are not
MSE-searched to avoid confusion for recipe users.
In `@modelopt/torch/quantization/plugins/custom.py`:
- Around line 148-153: The current check treats incomplete tail blocks as
invalid; instead compute blocks per row as ceil(weight.shape[-1] / block_size)
and total expected_blocks = (weight.numel() // weight.shape[-1]) *
blocks_per_row so padded trailing blocks count toward the expected amax length.
In the validation around quantizer.block_sizes / block_size, replace
expected_blocks = weight.numel() // block_size with rows = weight.numel() //
weight.shape[-1]; blocks_per_row = math.ceil(weight.shape[-1] / block_size) (or
integer ceil via (N + block_size - 1)//block_size); expected_blocks = rows *
blocks_per_row, then return amax.numel() == expected_blocks and
global_amax.numel() == 1, allowing restored `_amax` that includes padded tail
blocks.
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 88-99: The TP>1 guard is too broad because it triggers for any
fake static-block quantizer; change the check that builds offending to only
consider NVFP4 static-block quantizers by requiring both
leaf.is_static_block_quant and that the leaf reports the NVFP4 format (e.g.,
leaf.format == "NVFP4" or the project’s NVFP4 enum/attribute — replace with the
actual attribute used in your quantizer objects) when iterating over leaves (the
variables/functions involved: weight_quantizer, SequentialQuantizer, leaves,
is_static_block_quant, offending, tp_group.world_size()); keep the rest of the
logic and the NotImplementedError unchanged.
In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py`:
- Around line 45-65: The test is comparing config.json's quantization_config to
the raw HF wrapper (hf_quant_config_dict) instead of the converted serving
format; change the test to use the converted structure (call
convert_hf_quant_config_format on hf_quant_config_dict or otherwise use the same
transformation used when producing config_dict) before asserting and before
indexing fields like "quant_algo", "ignore", and "config_groups"; update
references in the verification block so quant_config_dict refers to the
converted result (not the original hf_quant_config_dict) and then perform the
existing assertions and kv_cache checks against that converted object.
---
Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 818-828: The code currently only calls
_record_excluded_module(prefix) when qformat is None, but QUANTIZATION_NONE
should be treated the same; update the branch in unified_export_megatron.py (the
block around qformat, QUANTIZATION_NONE, _get_weight_bias, and
_record_excluded_module) so that if qformat is None or qformat ==
QUANTIZATION_NONE (and "norm" not in prefix) you record the module as excluded
before the early return; keep the existing early return for QUANTIZATION_NONE
but ensure the exclude is recorded first and keep compatibility with
_qkv_slicing behavior.
In `@modelopt/torch/quantization/algorithms.py`:
- Around line 765-782: The loop currently accumulates best_constraints and
best_scores from best_hparam_recipe_info before replacing the local solver's
format with the DP/TP/EP-synchronized best_format; update the code so that after
you set best_recipe[name] = best_format and get_hparam(self.model, name).active
= best_format you recompute and add the costs and scores that correspond to the
actually activated best_format (not the original
best_hparam_recipe_info["format"]); locate the mapping of format->costs/scores
that the solver produced for the layer (referencing best_recipe_info,
best_hparam_recipe_info and get_hparam) and use that entry to increment
best_constraints and best_scores (and keep
self.best["constraints"]/self.best["score"] consistent with the activated
recipe).
---
Nitpick comments:
In `@tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py`:
- Around line 32-42: Add a focused unit test that exercises the restored
_global_amax code path: create an NVFP4StaticQuantizer via the existing helper
_make_static_quantizer (or directly instantiate NVFP4StaticQuantizer), set the
private attribute _global_amax (not global_amax) to a tensor value, perform the
export/import (or the same round‑trip flow used elsewhere in this test file) and
assert the quantizer restores using the _global_amax path (e.g., resulting
amax/global_amax behavior matches expected values). Ensure the test is small,
documents the expected behavior, and only validates the single round‑trip
regression scenario so the `_global_amax` restore remains covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 864054fb-7e5a-459d-9bc8-f15b0be42e2b
📒 Files selected for processing (26)
CHANGELOG.rstexamples/specdec_bench/specdec_bench/datasets/speed.pymodelopt/torch/export/plugins/hf_checkpoint_utils.pymodelopt/torch/export/plugins/mcore_nemotron.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/algorithms.pymodelopt/torch/quantization/backends/utils.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/conversion.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/plugins/custom.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yamltests/_test_utils/torch/quantization/quantize_common.pytests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.pytests/gpu_megatron/torch/export/test_unified_export_megatron.pytests/gpu_megatron/torch/quantization/plugins/test_megatron.pytests/unit/torch/export/test_hf_checkpoint_utils.pytests/unit/torch/quantization/plugins/test_fused_experts.pytests/unit/torch/quantization/test_autoquant.pytests/unit/torch/quantization/test_mse_calibrator.pytests/unit/torch/quantization/test_nvfp4_static_export_cpu.py
| # Calibration: weight MSE with FP8-scale sweep over the 128 e4m3 scale values | ||
| # (NVFP4 weights use static block scales selected by MSE; FP8 per-tensor scales | ||
| # are also chosen via MSE search instead of plain amax). |
There was a problem hiding this comment.
Update the calibration comment for FP8 layers.
The comment says FP8 per-tensor scales are selected via MSE search, but this stack skips MSE for non-NVFP4 formats. This is misleading for recipe users.
Proposed fix
-# (NVFP4 weights use static block scales selected by MSE; FP8 per-tensor scales
-# are also chosen via MSE search instead of plain amax).
+# (NVFP4 routed-expert weights use static block scales selected by MSE;
+# non-NVFP4 layers, such as FP8 per-tensor, follow the non-MSE path.)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around
lines 30 - 32, The calibration comment is misleading about FP8 scale selection:
update the comment near the calibration block that mentions "FP8 per-tensor
scales" and "NVFP4 weights" (the lines describing MSE searches) to explicitly
state that only NVFP4 weight block scales are selected via MSE while non-NVFP4
FP8 formats skip MSE and use the stack's default scaling method; edit the text
to clarify that FP8 per-tensor scales for non-NVFP4 are not MSE-searched to
avoid confusion for recipe users.
| block_sizes = getattr(quantizer, "block_sizes", None) | ||
| block_size = block_sizes.get(-1) if isinstance(block_sizes, dict) else None | ||
| if block_size is None or weight.shape[-1] % block_size != 0: | ||
| return False | ||
| expected_blocks = weight.numel() // block_size | ||
| return amax.numel() == expected_blocks and global_amax.numel() == 1 |
There was a problem hiding this comment.
Handle padded trailing blocks when validating restored NVFP4 state.
Static block quantization already pads the tail block during setup, so a restored _amax can be complete even when weight.shape[-1] % block_size != 0. Returning False here forces max_calibrate() and overwrites the saved MSE-derived scales for those layers.
Suggested fix
block_sizes = getattr(quantizer, "block_sizes", None)
block_size = block_sizes.get(-1) if isinstance(block_sizes, dict) else None
- if block_size is None or weight.shape[-1] % block_size != 0:
+ if block_size is None:
return False
- expected_blocks = weight.numel() // block_size
+ rows = weight.numel() // weight.shape[-1]
+ expected_blocks = rows * ((weight.shape[-1] + block_size - 1) // block_size)
return amax.numel() == expected_blocks and global_amax.numel() == 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/quantization/plugins/custom.py` around lines 148 - 153, The
current check treats incomplete tail blocks as invalid; instead compute blocks
per row as ceil(weight.shape[-1] / block_size) and total expected_blocks =
(weight.numel() // weight.shape[-1]) * blocks_per_row so padded trailing blocks
count toward the expected amax length. In the validation around
quantizer.block_sizes / block_size, replace expected_blocks = weight.numel() //
block_size with rows = weight.numel() // weight.shape[-1]; blocks_per_row =
math.ceil(weight.shape[-1] / block_size) (or integer ceil via (N + block_size -
1)//block_size); expected_blocks = rows * blocks_per_row, then return
amax.numel() == expected_blocks and global_amax.numel() == 1, allowing restored
`_amax` that includes padded tail blocks.
|
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
| schema = table.schema | ||
| if schema.metadata and b"huggingface" in schema.metadata: | ||
| new_meta = { | ||
| k: v |
There was a problem hiding this comment.
just a linter change
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
| return multimodal_state_dict | ||
|
|
||
|
|
||
| def copy_non_safetensor_files_from_ckpt(src: str | os.PathLike, dst: str | os.PathLike): |
There was a problem hiding this comment.
[SUGGESTION] The current implementation copies everything non-safetensors from the source — including config.json, hf_quant_config.json, generation_config.json, preprocessor_config.json. The docstring acknowledges this and says "The caller is expected to overwrite the files modelopt owns" — and today save_pretrained does, immediately after.
The risk is the load-bearing convention. If a future refactor adds a guarded path in save_pretrained that skips writing one of those files under some condition (e.g., a new is_last_stage_main_rank sub-branch, or a try/except around _hf_config.save_pretrained), the stale source file silently survives — no test failure, no warning, just a quietly-wrong exported checkpoint.
Two safer alternatives:
- Filter modelopt-owned files in the helper itself with an explicit skip-list (preferred, no caller-side discipline required):
_MODELOPT_OWNED_FILES = frozenset({
"config.json",
"generation_config.json",
"hf_quant_config.json",
"preprocessor_config.json",
})
def copy_non_safetensor_files_from_ckpt(src, dst):
...
for entry in os.listdir(src):
if entry in _MODELOPT_OWNED_FILES:
continue
if entry.endswith(".safetensors") or entry == "model.safetensors.index.json":
continue
...- Or add a post-condition assert at the end of
save_pretrainedthat the modelopt-owned files were rewritten (timestamp / contents check).
Option 1 removes the silent-failure mode entirely without changing today's behavior.
| combined_layer_config_dict.update(layer_config_dict) | ||
| return dict(sorted(combined_layer_config_dict.items())) | ||
|
|
||
| def _gather_kv_cache_dtype(self): |
There was a problem hiding this comment.
[SUGGESTION] Returning the first non-None kv_cache_dtype silently picks one if ranks disagree (programmer error). For a setup bug where one attention block was configured with fp8 and another with nvfp4, the writer rank would emit whichever rank 0 happened to see first into hf_quant_config.json with no warning.
Cheap defense:
def _gather_kv_cache_dtype(self):
local = getattr(self, "kv_cache_dtype", None)
if not torch.distributed.is_initialized():
return local
all_dtypes = [None] * torch.distributed.get_world_size()
torch.distributed.all_gather_object(all_dtypes, local)
seen = {dt for dt in all_dtypes if dt is not None}
if len(seen) > 1:
raise RuntimeError(f"Inconsistent kv_cache_dtype across ranks: {seen}")
return seen.pop() if seen else NoneSame applies to _gather_layer_config_dict if a key appears with conflicting values across ranks — current .update() silently picks the last one.
| self._hf_pretrained_model_name | ||
| ): | ||
| try: | ||
| tokenizer = transformers.AutoTokenizer.from_pretrained( |
There was a problem hiding this comment.
[SUGGESTION] Behavior change worth calling out in the PR body: the previous code unconditionally tried AutoTokenizer.from_pretrained(...).save_pretrained(save_directory) and silently swallowed errors. That had a useful side effect — if the source dir had a partial tokenizer (e.g. tokenizer.json present but tokenizer_config.json missing or stale), AutoTokenizer would often synthesize a valid tokenizer_config.json on the save side.
The new code skips this entirely for local-dir sources, trusting whatever the bulk copy produced. Mostly fine for clean source dirs, but it removes the safety net for partial/stale tokenizer files. Not a blocker — just consider keeping the AutoTokenizer call as a "second pass" overwrite even in the local-dir case, since it's idempotent on a clean source and corrective on a stale one.
Review summary — two asks1. Split the AutoQuant changes into a separate PRThe PR title and body advertise four things: MCore mixed-precision export, static MSE NVFP4 fixes, the Nemotron-3 Super NVFP4 YAML recipe, and
These are not prerequisites for the Nemotron Super recipe — the recipe is a static hand-authored YAML, not an 2. Export-path concerns — four inline comments belowClaude's review already covered:
Not in Claude's review, posting inline:
Algorithmic correctness of the per-layer config aggregation + |
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
|
@ChenhanYu I removed Autoquant & GPTQ changes |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Design review: this PR extends existing infrastructure rather than introducing new abstractions — the MCore exporter now feeds its per-layer quant metadata through the existing process_layer_quant_config / layer_config_dict pipeline already used by model_config_export.py, and the recipes ride the YAML composition system from #1253. So the architectural choice is fine. However, several substantive issues remain:
-
Untested on the target model. PR body says "TODO test in HF and MCore PTQ on Nemotron model" — for a recipe explicitly mirroring a published Nemotron-3-Super NVFP4 config, the equivalence/round-trip check on real hardware is the headline test. Please run it before merging.
-
Undocumented public-API behavior change in
mse_calibrate(..., fp8_scale_sweep=True). Previously, an FP8 (or any non-NVFP4) quantizer with no registered FP8-sweep backend factory fell through to a defaultMseCalibrator. The PR dropstest_unregistered_backend_uses_default_mse_calibratorand the new control flow leaves such quantizers untouched (max-calibrated amax preserved). The new mixed-precision tests rely on this, but neither the docstring ofmse_calibrate/fp8_scale_sweep, theMseCalibConfigdescription, nor the PR body / changelog mention that callingfp8_scale_sweep=Trueon a non-mixed FP8 model now silently keeps max amax instead of MSE-searching it. Either restore the fallback or document the change explicitly (changelog + docstring). -
PR size. 25 files / 1468 LOC across MCore export, calibration, restore, recipes, and hf_checkpoint_utils. The four pieces are largely independent and could be split — please consider it for the next round.
-
Minor: redundant
_record_layer_quant_config(prefix, qformat, block_size)calls inside the per-expert loops in_pack_name_remappingand_pack_name_remapping_gpt_oss(sameprefixoverwritten N times);super-nvfp4-max-calib.yamlheader isCopyright (c) 2024while the rest of the repo and the siblingsuper-nvfp4.yamluse2026(LICENSE_HEADER canonical is 2026); emptyUsagecode-fence in PR body.
| start_multiplier=start_multiplier, | ||
| stop_multiplier=stop_multiplier, | ||
| quant_func=partial(_mse_quant_func, quantizer=module), | ||
| ) |
There was a problem hiding this comment.
Bot comment.
This is a public-API behavior change that the PR doesn't document. Before this PR, calling mse_calibrate(..., fp8_scale_sweep=True) on an FP8 (or any non-NVFP4) quantizer would fall through to a default MseCalibrator (covered by the now-deleted test_unregistered_backend_uses_default_mse_calibrator). After this PR, with fp8_scale_sweep=True and no registered backend factory, a non-is_nvfp4_static quantizer takes neither branch, so _calibrator is left as the max-calibrated state and no MSE search runs.
This is what you want for the new mixed NVFP4+FP8 recipes, but it silently changes behavior for any existing caller passing fp8_scale_sweep=True on a homogeneous FP8 model. Please either:
- restore the default MSE fallback for non-NVFP4 quantizers under
fp8_scale_sweep=True, or - document the new semantics in the
mse_calibratedocstring, thefp8_scale_sweepdocstring onMseCalibConfig, and the changelog, so users aren't confused why their FP8 amax stopped moving.
| weight_scale_list.append(weight_scale) | ||
| weight_scale_2_list.append(weight_scale_2) | ||
| input_scale_list.append(input_scale) | ||
| self._record_layer_quant_config(prefix, qformat, block_size) |
There was a problem hiding this comment.
Bot comment.
self._record_layer_quant_config(prefix, qformat, block_size) is called inside the for expert in module: loop with the same prefix every iteration, so it just overwrites the same two layer_config_dict keys N times. Move this call out of the loop (same applies to the corresponding line in _pack_name_remapping_gpt_oss).
| @@ -0,0 +1,134 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
please restructure the recipe files:
modelopt_recipes/huggingface/nemotron_h/Nemotron-3-Super-120B-A12B/ptq/super-nvfp4.yaml
modelopt_recipes/huggingface/nemotron_h/Nemotron-3-Super-120B-A12B/ptq/super-nvfp4-max-calib.yaml
There was a problem hiding this comment.
I can't move it to a huggingface folder, these are for both HF and MCore. but I can add an NVIDIA/ folder to mimic the HuggingFace model name
There was a problem hiding this comment.
it's still a huggingface model, we shall still put them there.
There was a problem hiding this comment.
just that in the yaml file, for an MCore config, add an mcore tag to the name to make it explicit.
There was a problem hiding this comment.
hmm these recipes work for both HF and MCore since they have patterns for both ...
There was a problem hiding this comment.
Got it, it's your call to name them but putting them into huggingface is still correct, MCore is the tool we use for quantization, but the model itself is HF. We identify the recipes by the models not by the tools we use for quantization.
There was a problem hiding this comment.
ok I can move to huggingface folder .. just curious do we expect to suport models outside of HF in the future?
There was a problem hiding this comment.
I do want to. But the world seems converging to HF. So I am not sure if we have the chance/need of doing that lol.
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | ||
| - quantizer_name: '*mlp.experts*weight_quantizer' |
There was a problem hiding this comment.
So these configs use megatron path names right?
There was a problem hiding this comment.
yes we need both HF and MCore names for quantization
There was a problem hiding this comment.
If I understand it correctly how things work with MCore-based quantization, then this does not seem to be correct way in my opinion. But I am not yet 100% sure.
In my opinion, the quantization config shall be just configure the HF format, and our library internally shall convert the HF format based config to MCore converted config.
But let me read through this part of logic and will catch you for a discussion.
Non-blocking for this PR.
There was a problem hiding this comment.
no that's not true, it does pattern matching based on the model structure and in MCore the model has different names than the HF model
We should come up with a way to unify the HF and MCore PTQ API though, I agree.
There was a problem hiding this comment.
I understand the MCore models have different module paths, but if I understand it correctly, the MCore models are converted as intermediate models from HF models, which is the original input to the quant pipeline.
So for a config for the quant pipeline, or any other pipeline, it should target the original HF input model rather than the intermediate model format. Internally, we should then convert the original config to a config that makes intermediate modeling work.
Maybe my understanding is not correct. That's why I need to read through this part of logic. Will catch you up once I have a full understanding.
There was a problem hiding this comment.
There are two options for MCore PTQ model loading: directly from HF model or directly from MCore model.
Oftentimes we choose the 2nd path to avoid the time it takes to convert a HF model to MCore model. For the 2nd path that's why we need names in both HF and MCore convention.
I understand what you're saying though, it's not immediately obvious to users that they have two ways to pass in model for PTQ.
There was a problem hiding this comment.
Ideally we should just have recipes in HF convention and both MCore PTQ paths should work by ModelOpt storing some HF to MCore model name mapping .. that can be improved in the future when we unify HF and MCore PTQ APIs.
There was a problem hiding this comment.
Right, that's exactly what I mean. The mapping should be done by modelopt internally, even if we load from a MCore checkpoint, we should still provide the original model's information to create the mapping.
ChenhanYu
left a comment
There was a problem hiding this comment.
Went through the export part. Great that the Autoquant related changes have been separated out.
| sync_expert_weight_amax: SequentialMLP only — share one weight amax across all experts | ||
| in a MoE layer (within-rank sync + EP all-reduce when EP>1). |
There was a problem hiding this comment.
Would this impact accuracy? I think for HF PTQ, experts have separate amax values
There was a problem hiding this comment.
HF PTQ doesn't use this EP sync
There was a problem hiding this comment.
HF PTQ doesn't use this EP sync
That's true, but I am more curious about how this will impact the accuracy. Have we run anything to measure the impact on the accuracy?
There was a problem hiding this comment.
We should deprecate this argument one TE MoE supports seperate quantizers per expert.
There was a problem hiding this comment.
sync_expert_weight_amax is by default False in max_calibrate. It is added for testing purposes only (e.g. to compare against TEGroupedMLP which still shares amax for all experts.
The old behavior of MCore PTQ was to always sync EP experts, which during Nano/Super PTQ experiments we realized could lead to accuracy degradation. We removed the rank local expert sync in the layer_sync_moe_local_experts_amax function, but forgot to remove the cross-EP-rank sync. This PR removes the EP sync so that all experts in an MoE have different amaxes for full correctness.
There was a problem hiding this comment.
if you see _should_sync_amax_across_ep in line 256 it skips EP sync for routed experts unless you turn on sync_expert_weight_amax
There was a problem hiding this comment.
@realAsma agreed, I have a PR to add separate expert quantizers in TEGroupedMLP but that needs more testing ..
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
8f1d879 to
d63bf70
Compare
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
What does this PR do?
Type of change: New recipe + Bug Fixes
MCore and MSE fixes
NVFP4QTensor(not TensorQuantizer which can call max calibrate. we want to skip max calibrate for static quantizer during restore) --> fixes bug during MCore export for MSEfp8_scale_sweep=Trueblock_sizesis dict-backed.hf_quant_config.jsonExport bug fixes
Super recipe
Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json:
rest: not quantized
Usage
# Add a code snippet demonstrating how to use thisTesting
Tested on Nemotron model
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation