Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/llm_ptq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ python hf_ptq.py \
--export_path <quantized_ckpt_path>
```

Built-in recipes are located in `modelopt_recipes/general/ptq/` for model-agnostic recipes and in `modelopt_recipes/huggingface/<model_type>/ptq/` for recipes tuned to a specific Hugging Face `model_type` (see [`modelopt_recipes/huggingface/README.md`](../../modelopt_recipes/huggingface/README.md)). You can also provide a path to your own custom YAML recipe file or directory. See the [recipe documentation](https://nvidia.github.io/Model-Optimizer) for details on the YAML schema and available recipes.
Built-in recipes are located in `modelopt_recipes/general/ptq/` for model-agnostic recipes and in `modelopt_recipes/huggingface/<model_type>/ptq/` for recipes tuned to a specific Hugging Face `model_type` (see [`modelopt_recipes/huggingface/README.md`](../../modelopt_recipes/huggingface/README.md)). For Llama 3.x NVFP4, start with `huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast`. You can also provide a path to your own custom YAML recipe file or directory. See the [recipe documentation](https://nvidia.github.io/Model-Optimizer) for details on the YAML schema and available recipes.

> *When `--recipe` is specified, `--qformat` and `--kv_cache_qformat` are ignored. The recipe fully defines the quantization configuration.*

Expand Down
52 changes: 52 additions & 0 deletions modelopt_recipes/general/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Composed PTQ recipe for MLP/MoE-only dynamic NVFP4 quantization with FP8 KV-cache cast mode.

imports:
base_disable_all: configs/ptq/units/base_disable_all
default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
nvfp4: configs/numerics/nvfp4
kv_fp8_cast: configs/ptq/units/kv_fp8_cast

metadata:
recipe_type: ptq
description: >-
Applies dynamic NVFP4 only to MLP/MoE weight and input quantizers, plus FP8 KV-cache
cast mode (constant amax, no KV calibration); uses max calibration.
quantize:
algorithm: max
quant_cfg:
- $import: base_disable_all
- quantizer_name: '*mlp*weight_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*mlp*input_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*block_sparse_moe*weight_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*block_sparse_moe*input_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*.experts.*weight_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*.experts.*input_quantizer'
cfg:
$import: nvfp4
- $import: kv_fp8_cast
- $import: default_disabled_quantizers
3 changes: 2 additions & 1 deletion modelopt_recipes/huggingface/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ ones. When deciding which to use:
`<specific_model>/` folder if the recipe is tuned for one released
checkpoint rather than every checkpoint of that `model_type`. The
presence of a folder here signals that there is a recommended recipe
for that `model_type` or model instance.
for that `model_type` or model instance. For example, see
`huggingface/llama/ptq/` for Llama 3.x NVFP4 recipes.
2. **Fall back to `general/`** if no `<model_type>/` folder applies. The
general recipes are a good starting point for any model — and the
recommended starting point for a model architecture that does not yet
Expand Down
29 changes: 29 additions & 0 deletions modelopt_recipes/huggingface/llama/ptq/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Llama PTQ recipes

Recipes for Hugging Face models with `model_type: llama` (Llama 3.x, Llama 3.1,
Llama 3.2, Llama 3.3, etc.).

## Choosing a recipe

| Recipe | When to use |
|--------|-------------|
| `nvfp4_mlp_only-kv_fp8_cast.yaml` | **Recommended starting point** for NVFP4 on Llama when full W4A4 hurts accuracy. Quantizes MLP (and MoE expert) layers only; attention stays unquantized. FP8 KV cache uses cast mode (no extra KV calibration). |
| `nvfp4_default-kv_fp8_cast.yaml` | Full dynamic NVFP4 W4A4 on all linear layers when you need maximum compression and can accept more accuracy loss. |

General-model equivalents live under `modelopt_recipes/general/ptq/` with the
same numerics. These `huggingface/llama/ptq/` paths exist so users can pass a
model-family recipe without guessing:

```bash
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path meta-llama/Llama-3.1-8B-Instruct \
--recipe huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast \
--export_path ./llama-3.1-8b-nvfp4-mlp-only
```

For data-driven KV calibration, use the general recipes with `kv_fp8` instead of
`kv_fp8_cast` (e.g. `general/ptq/nvfp4_mlp_only-kv_fp8`).

NVFP4 inference requires NVIDIA Blackwell GPUs and a compatible runtime (TensorRT-LLM,
vLLM, or SGLang). Recipe validation and unit tests run on CPU; end-to-end PTQ
export still requires a CUDA GPU.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Llama-family PTQ recipe for full dynamic NVFP4 W4A4 with FP8 KV-cache cast.

imports:
base_disable_all: configs/ptq/units/base_disable_all
default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
w4a4_nvfp4_nvfp4: configs/ptq/units/w4a4_nvfp4_nvfp4
kv_fp8_cast: configs/ptq/units/kv_fp8_cast

metadata:
recipe_type: ptq
description: >-
Llama PTQ recipe (model_type llama): dynamic NVFP4 W4A4 on all linear layers,
plus FP8 KV-cache cast mode. Same numerics as general/ptq/nvfp4_default-kv_fp8_cast.
For higher accuracy on smaller Llama models, prefer nvfp4_mlp_only-kv_fp8_cast.
quantize:
algorithm: max
quant_cfg:
- $import: base_disable_all
- $import: w4a4_nvfp4_nvfp4
- $import: kv_fp8_cast
- $import: default_disabled_quantizers
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Llama-family PTQ recipe for partial NVFP4 (MLP only) with FP8 KV-cache cast.
# Recommended starting point for Llama 3.x when full W4A4 NVFP4 accuracy loss is high.

imports:
base_disable_all: configs/ptq/units/base_disable_all
default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
nvfp4: configs/numerics/nvfp4
kv_fp8_cast: configs/ptq/units/kv_fp8_cast

metadata:
recipe_type: ptq
description: >-
Llama PTQ recipe (model_type llama): dynamic NVFP4 on MLP/MoE layers only,
attention QKV projections stay in BF16/FP16, plus FP8 KV-cache cast mode.
Same numerics as general/ptq/nvfp4_mlp_only-kv_fp8_cast; use this path for
Llama 3.x checkpoints via --recipe huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast.
quantize:
algorithm: max
quant_cfg:
- $import: base_disable_all
- quantizer_name: '*mlp*weight_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*mlp*input_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*block_sparse_moe*weight_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*block_sparse_moe*input_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*.experts.*weight_quantizer'
cfg:
$import: nvfp4
- quantizer_name: '*.experts.*input_quantizer'
cfg:
$import: nvfp4
- $import: kv_fp8_cast
- $import: default_disabled_quantizers
3 changes: 3 additions & 0 deletions tests/unit/recipe/test_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,10 @@ def test_load_recipe_builtin_description():
"general/ptq/nvfp4_experts_only-kv_fp8",
"general/ptq/nvfp4_experts_only-kv_fp8_layerwise",
"general/ptq/nvfp4_mlp_only-kv_fp8",
"general/ptq/nvfp4_mlp_only-kv_fp8_cast",
"general/ptq/nvfp4_omlp_only-kv_fp8",
"huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast",
"huggingface/llama/ptq/nvfp4_default-kv_fp8_cast",
]


Expand Down