Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6e456b0
fix: PTQ 1GPU, export PP divisibility, hidden states conversations key
ChenhanYu Apr 18, 2026
48d0a37
add nvfp4-w4a16 support
hychiang-git Apr 22, 2026
f34694b
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git Apr 22, 2026
2ee0825
update CHANGELOG.rst and add test for nvfp4 w4a16
hychiang-git Apr 22, 2026
0e8815d
tiny fix on CHANGELOG.rst
hychiang-git Apr 22, 2026
dc506c6
tiny fix on CHANGELOG.rst
hychiang-git Apr 22, 2026
d8611e2
f*{mod}*_quantizer -> f*{mod}*.weight_quantizer and f*{mod}*.input_qu…
hychiang-git Apr 22, 2026
4b62138
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git Apr 23, 2026
36ce897
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git Apr 23, 2026
41f9988
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git Apr 27, 2026
c3b884a
rm --exclude_modules
hychiang-git Apr 29, 2026
75ff6b2
rm --exclude_modules
hychiang-git Apr 29, 2026
b3a40fd
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git Apr 29, 2026
81f845a
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git May 5, 2026
0fede96
nvfp4_w4a16 -> w4a16_nvfp4
hychiang-git May 5, 2026
2cc3230
fix vllm comments
hychiang-git May 6, 2026
9409858
merge main, resolve conflicts
hychiang-git May 6, 2026
ea1d0b2
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git May 6, 2026
8594574
make huggingface_example.sh and parser.sh support recipes
hychiang-git May 6, 2026
9808e01
add w4a16_nvfp4 recipes
hychiang-git May 6, 2026
e8c0602
fix: call mono_quantize() for --recipe path regardless of --qformat d…
hychiang-git May 7, 2026
08cd44f
add weight-only supports in _QuantFusedExperts
hychiang-git May 7, 2026
2c9186f
merge main
hychiang-git May 8, 2026
27bab2b
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git May 11, 2026
f3e4a13
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git May 12, 2026
f0b1113
rename w4a16_nvfp4.yaml to w4_nvfp4.yaml
hychiang-git May 12, 2026
62dd281
Merge branch 'hungyueh/modelopt-nvfp4-w4a16' of github.com:NVIDIA/Mod…
hychiang-git May 12, 2026
6cad5ea
merge main and resolve conflicts
hychiang-git May 12, 2026
ad01342
w4a16_nvfp4: configs/ptq/units/w4a16_nvfp4 -> w4a16_nvfp4: configs/pt…
hychiang-git May 13, 2026
1be38df
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git May 13, 2026
56a1e78
rm comments
hychiang-git May 14, 2026
27fbb53
rm docker guard
hychiang-git May 14, 2026
14d5e0d
refactor: simplify mono_quantize condition to if quant_cfg
hychiang-git May 14, 2026
d3e2c90
Merge branch 'main' into hungyueh/modelopt-nvfp4-w4a16
hychiang-git May 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Changelog

**New Features**

- Add NVFP4 W4A16 weight-only quantization (``w4a16_nvfp4``): FP4 weights with group_size=16, BF16 activations, no calibration forward pass required. Use ``mtq.W4A16_NVFP4_CFG`` or ``--qformat w4a16_nvfp4`` in ``hf_ptq.py``. vLLM deployment support is in progress.
- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
Comment thread
hychiang-git marked this conversation as resolved.
- Add end-to-end tutorial for Minitron pruning + distillation + quantization + evaluation + vLLM deployment for Nemotron-Nano-9B-v2 → Pruned 7B along with data blend preparation steps (and ablation study). See `examples/pruning/minitron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning/minitron/>`_ for details.
- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
Expand Down
9 changes: 8 additions & 1 deletion examples/llm_ptq/hf_ptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ def _set_kv_cache_constant_amax(quant_cfg: list) -> None:
"fp8_pb_wo": mtq.FP8_2D_BLOCKWISE_WEIGHT_ONLY_CFG,
"fp8_pc_pt": mtq.FP8_PER_CHANNEL_PER_TOKEN_CFG,
"w4a8_nvfp4_fp8": mtq.W4A8_NVFP4_FP8_CFG,
"w4a16_nvfp4": mtq.W4A16_NVFP4_CFG,
"w4a8_mxfp4_fp8": mtq.W4A8_MXFP4_FP8_CFG,
"nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG,
"nvfp4_experts_only": mtq.NVFP4_EXPERTS_ONLY_CFG,
Expand Down Expand Up @@ -785,6 +786,12 @@ def export_quantized(
extra_state_dict=mtp_state_dict,
)

if args.qformat == "w4a16_nvfp4":
Comment thread
hychiang-git marked this conversation as resolved.
warnings.warn(
"TensorRT-LLM and SGLang do not support this format. "
"vLLM deployment support is in progress."
)
Comment thread
hychiang-git marked this conversation as resolved.

# Restore default padding and export the tokenizer as well.
if tokenizer is not None:
tokenizer.padding_side = default_padding_side
Expand Down Expand Up @@ -1147,7 +1154,7 @@ def _is_layerwise(obj):
quant_cfg = copy.deepcopy(quant_cfg)
force_weight_quantizers_static(quant_cfg["quant_cfg"])

if args.qformat in QUANT_CFG_CHOICES:
if quant_cfg:
mono_quantize(
args,
quant_cfg,
Expand Down
13 changes: 13 additions & 0 deletions modelopt/torch/export/convert_hf_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,11 @@ def _quant_algo_to_group_config(quant_algo: str, group_size: int | None = None)
return {
"weights": {"dynamic": False, "num_bits": 4, "type": "int", "group_size": gs},
}
elif quant_algo == "W4A16_NVFP4":
gs = group_size or 16
return {
"weights": {"dynamic": False, "num_bits": 4, "type": "float", "group_size": gs},
}
elif quant_algo in ("NVFP4_AWQ", "W4A8_AWQ"):
gs = group_size or 128
return {
Expand Down Expand Up @@ -183,6 +188,14 @@ def convert_hf_quant_config_format(input_config: dict[str, Any]) -> dict[str, An
"targets": ["Linear"],
}
new_config["config_groups"] = {"group_0": config_group_details}
elif quant_algo_value == "W4A16_NVFP4":
# Weight-only FP4
group_size = original_quantization_details.get("group_size", 16)
config_group_details = {
"weights": {"dynamic": False, "num_bits": 4, "type": "float", "group_size": group_size},
"targets": ["Linear"],
}
new_config["config_groups"] = {"group_0": config_group_details}
elif quant_algo_value == "MIXED_PRECISION":
quantized_layers = original_quantization_details.get("quantized_layers", {})

Expand Down
1 change: 1 addition & 0 deletions modelopt/torch/export/model_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
QUANTIZATION_MXFP4 = "mxfp4"
QUANTIZATION_MXFP8 = "mxfp8"
QUANTIZATION_W4A8_MXFP4_FP8 = "w4a8_mxfp4_fp8"
QUANTIZATION_W4A16_NVFP4 = "w4a16_nvfp4"
QUANTIZATION_NVFP4_AWQ = "nvfp4_awq"
QUANTIZATION_FP8_PB_REAL = "fp8_pb_real"
QUANTIZATION_FP8_PB_WO = "fp8_pb_wo"
Expand Down
12 changes: 12 additions & 0 deletions modelopt/torch/export/quant_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
QUANTIZATION_W4A8_AWQ,
QUANTIZATION_W4A8_MXFP4_FP8,
QUANTIZATION_W4A8_NVFP4_FP8,
QUANTIZATION_W4A16_NVFP4,
)

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -359,6 +360,7 @@ def get_weight_scaling_factor(module: nn.Module, weight_name: str = "weight") ->
QUANTIZATION_NVFP4,
QUANTIZATION_NVFP4_AWQ,
QUANTIZATION_NVFP4_SVDQUANT,
QUANTIZATION_W4A16_NVFP4,
QUANTIZATION_W4A8_NVFP4_FP8,
]:
# Calibrate weight quantizer if amax is not set
Expand Down Expand Up @@ -403,6 +405,7 @@ def get_weight_scaling_factor_2(module: nn.Module, weight_name: str = "weight")
QUANTIZATION_NVFP4,
QUANTIZATION_NVFP4_AWQ,
QUANTIZATION_NVFP4_SVDQUANT,
QUANTIZATION_W4A16_NVFP4,
QUANTIZATION_W4A8_NVFP4_FP8,
]:
# Calibrate weight quantizer if amax is not set
Expand Down Expand Up @@ -641,6 +644,9 @@ def _get_quantization_from_layer(layer, quantizer_attr_names: QuantizerAttrNames
return QUANTIZATION_NVFP4_AWQ
if getattr(layer, "fused_with_prequant", False):
return QUANTIZATION_NVFP4_AWQ
if input_quantizer is None or not input_quantizer.is_enabled:
if scale_bits == (4, 3):
return QUANTIZATION_W4A16_NVFP4
assert input_quantizer is not None, (
f"input_quantizer is None for {quantizer_attr_names}"
)
Expand Down Expand Up @@ -808,6 +814,11 @@ def process_layer_quant_config(layer_config_dict):
"quant_algo": "NVFP4",
"group_size": block_size_value,
}
elif v == "w4a16_nvfp4":
layer_config = {
"quant_algo": "W4A16_NVFP4",
"group_size": block_size_value,
}
elif v == "nvfp4_awq":
layer_config = {
"quant_algo": "NVFP4_AWQ",
Expand Down Expand Up @@ -985,6 +996,7 @@ def to_quantized_weight(
if quantization in [
QUANTIZATION_NVFP4,
QUANTIZATION_NVFP4_AWQ,
QUANTIZATION_W4A16_NVFP4,
QUANTIZATION_W4A8_NVFP4_FP8,
QUANTIZATION_NVFP4_SVDQUANT,
]:
Expand Down
3 changes: 3 additions & 0 deletions modelopt/torch/export/unified_export_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@
QUANTIZATION_NVFP4_SVDQUANT,
QUANTIZATION_W4A8_AWQ,
QUANTIZATION_W4A8_NVFP4_FP8,
QUANTIZATION_W4A16_NVFP4,
)
from .model_utils import get_language_model_from_vl, is_multimodal_model
from .moe_utils import _export_fused_experts
Expand Down Expand Up @@ -521,6 +522,7 @@ def _export_quantized_weight(
QUANTIZATION_NVFP4_AWQ,
QUANTIZATION_NVFP4_SVDQUANT,
QUANTIZATION_NVFP4,
QUANTIZATION_W4A16_NVFP4,
QUANTIZATION_W4A8_AWQ,
QUANTIZATION_W4A8_NVFP4_FP8,
]:
Expand Down Expand Up @@ -550,6 +552,7 @@ def _export_quantized_weight(
QUANTIZATION_NVFP4,
QUANTIZATION_NVFP4_AWQ,
QUANTIZATION_NVFP4_SVDQUANT,
QUANTIZATION_W4A16_NVFP4,
]:
# Transpose weight from (num_experts, input_dim, output_dim) to (num_experts, output_dim, input_dim)
# for NVFP4 quantization functions that expect input_dim as the last dimension for block quantization
Expand Down
2 changes: 2 additions & 0 deletions modelopt/torch/quantization/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1684,6 +1684,7 @@ def _nvfp4_selective_quant_cfg(
],
"algorithm": "max",
}
W4A16_NVFP4_CFG = _nvfp4_selective_quant_cfg(["*"], weight_only=True)

MXFP4_MLP_WEIGHT_ONLY_CFG = {
"quant_cfg": [
Expand Down Expand Up @@ -1740,6 +1741,7 @@ def _nvfp4_selective_quant_cfg(
"NVFP4_FP8_MHA_CONFIG",
"NVFP4_KV_CFG",
"NVFP4_KV_ROTATE_CFG",
"W4A16_NVFP4_CFG",
"W4A8_NVFP4_FP8_CFG",
"NVFP4_SVDQUANT_DEFAULT_CFG",
"W4A8_AWQ_BETA_CFG",
Expand Down
24 changes: 24 additions & 0 deletions modelopt_recipes/configs/ptq/units/w4_nvfp4.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# W4A16 NVFP4: NVFP4 E2M1 dynamic weight quantizer only; activations remain in BF16.

# modelopt-schema: modelopt.torch.quantization.config.QuantizerCfgListConfig
imports:
nvfp4: configs/numerics/nvfp4
---
- quantizer_name: '*weight_quantizer'
Comment thread
hychiang-git marked this conversation as resolved.
cfg:
$import: nvfp4
29 changes: 29 additions & 0 deletions modelopt_recipes/general/ptq/nvfp4_weight_only-kv_fp16.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

imports:
base_disable_all: configs/ptq/units/base_disable_all
default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
w4a16_nvfp4: configs/ptq/units/w4_nvfp4

metadata:
Comment thread
hychiang-git marked this conversation as resolved.
recipe_type: ptq
description: NVFP4 W4A16 weight-only, BF16 activations, max calibration. No calibration forward pass required.
quantize:
algorithm: max
quant_cfg:
- $import: base_disable_all
- $import: w4a16_nvfp4
- $import: default_disabled_quantizers
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
("w4a8_awq", "tiny_llama-w4a8-awq", True, False, True, True, False),
("int8_wo", "tiny_llama-int8-wo", False, False, False, False, False),
("nvfp4_svdquant", "tiny_llama-nvfp4-svdquant", True, False, True, True, True),
("w4a16_nvfp4", "tiny_llama-w4a16-nvfp4", False, False, False, False, False),
# MoE models (fused experts: Qwen3 MoE, GPT-OSS)
("nvfp4", "tiny_qwen3_moe-nvfp4", True, False, True, True, False),
("fp8", "tiny_gpt_oss-fp8", True, False, True, True, False),
Expand Down
Loading