NVIDIA · kiranbeethoju · Jun 6, 2026
@@ -183,7 +183,7 @@ python hf_ptq.py \
   --export_path <quantized_ckpt_path>
 ```
 
-Built-in recipes are located in `modelopt_recipes/general/ptq/` for model-agnostic recipes and in `modelopt_recipes/huggingface/<model_type>/ptq/` for recipes tuned to a specific Hugging Face `model_type` (see [`modelopt_recipes/huggingface/README.md`](../../modelopt_recipes/huggingface/README.md)). You can also provide a path to your own custom YAML recipe file or directory. See the [recipe documentation](https://nvidia.github.io/Model-Optimizer) for details on the YAML schema and available recipes.
+Built-in recipes are located in `modelopt_recipes/general/ptq/` for model-agnostic recipes and in `modelopt_recipes/huggingface/<model_type>/ptq/` for recipes tuned to a specific Hugging Face `model_type` (see [`modelopt_recipes/huggingface/README.md`](../../modelopt_recipes/huggingface/README.md)). For Llama 3.x NVFP4, start with `huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast`. You can also provide a path to your own custom YAML recipe file or directory. See the [recipe documentation](https://nvidia.github.io/Model-Optimizer) for details on the YAML schema and available recipes.
 
 > *When `--recipe` is specified, `--qformat` and `--kv_cache_qformat` are ignored. The recipe fully defines the quantization configuration.*
 

@@ -0,0 +1,52 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Composed PTQ recipe for MLP/MoE-only dynamic NVFP4 quantization with FP8 KV-cache cast mode.
+
+imports:
+  base_disable_all: configs/ptq/units/base_disable_all
+  default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
+  nvfp4: configs/numerics/nvfp4
+  kv_fp8_cast: configs/ptq/units/kv_fp8_cast
+
+metadata:
+  recipe_type: ptq
+  description: >-
+    Applies dynamic NVFP4 only to MLP/MoE weight and input quantizers, plus FP8 KV-cache
+    cast mode (constant amax, no KV calibration); uses max calibration.
+quantize:
+  algorithm: max
+  quant_cfg:
+    - $import: base_disable_all
+    - quantizer_name: '*mlp*weight_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*mlp*input_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*block_sparse_moe*weight_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*block_sparse_moe*input_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*.experts.*weight_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*.experts.*input_quantizer'
+      cfg:
+        $import: nvfp4
+    - $import: kv_fp8_cast
+    - $import: default_disabled_quantizers
@@ -14,7 +14,8 @@ ones. When deciding which to use:
    `<specific_model>/` folder if the recipe is tuned for one released
    checkpoint rather than every checkpoint of that `model_type`. The
    presence of a folder here signals that there is a recommended recipe
-   for that `model_type` or model instance.
+   for that `model_type` or model instance. For example, see
+   `huggingface/llama/ptq/` for Llama 3.x NVFP4 recipes.
 2. **Fall back to `general/`** if no `<model_type>/` folder applies. The
    general recipes are a good starting point for any model — and the
    recommended starting point for a model architecture that does not yet

@@ -0,0 +1,29 @@
+# Llama PTQ recipes
+
+Recipes for Hugging Face models with `model_type: llama` (Llama 3.x, Llama 3.1,
+Llama 3.2, Llama 3.3, etc.).
+
+## Choosing a recipe
+
+| Recipe | When to use |
+|--------|-------------|
+| `nvfp4_mlp_only-kv_fp8_cast.yaml` | **Recommended starting point** for NVFP4 on Llama when full W4A4 hurts accuracy. Quantizes MLP (and MoE expert) layers only; attention stays unquantized. FP8 KV cache uses cast mode (no extra KV calibration). |
+| `nvfp4_default-kv_fp8_cast.yaml` | Full dynamic NVFP4 W4A4 on all linear layers when you need maximum compression and can accept more accuracy loss. |
+
+General-model equivalents live under `modelopt_recipes/general/ptq/` with the
+same numerics. These `huggingface/llama/ptq/` paths exist so users can pass a
+model-family recipe without guessing:
+
+```bash
+python examples/llm_ptq/hf_ptq.py \
+  --pyt_ckpt_path meta-llama/Llama-3.1-8B-Instruct \
+  --recipe huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast \
+  --export_path ./llama-3.1-8b-nvfp4-mlp-only
+```
+
+For data-driven KV calibration, use the general recipes with `kv_fp8` instead of
+`kv_fp8_cast` (e.g. `general/ptq/nvfp4_mlp_only-kv_fp8`).
+
+NVFP4 inference requires NVIDIA Blackwell GPUs and a compatible runtime (TensorRT-LLM,
+vLLM, or SGLang). Recipe validation and unit tests run on CPU; end-to-end PTQ
+export still requires a CUDA GPU.
@@ -0,0 +1,36 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Llama-family PTQ recipe for full dynamic NVFP4 W4A4 with FP8 KV-cache cast.
+
+imports:
+  base_disable_all: configs/ptq/units/base_disable_all
+  default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
+  w4a4_nvfp4_nvfp4: configs/ptq/units/w4a4_nvfp4_nvfp4
+  kv_fp8_cast: configs/ptq/units/kv_fp8_cast
+
+metadata:
+  recipe_type: ptq
+  description: >-
+    Llama PTQ recipe (model_type llama): dynamic NVFP4 W4A4 on all linear layers,
+    plus FP8 KV-cache cast mode. Same numerics as general/ptq/nvfp4_default-kv_fp8_cast.
+    For higher accuracy on smaller Llama models, prefer nvfp4_mlp_only-kv_fp8_cast.
+quantize:
+  algorithm: max
+  quant_cfg:
+    - $import: base_disable_all
+    - $import: w4a4_nvfp4_nvfp4
+    - $import: kv_fp8_cast
+    - $import: default_disabled_quantizers
@@ -0,0 +1,55 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Llama-family PTQ recipe for partial NVFP4 (MLP only) with FP8 KV-cache cast.
+# Recommended starting point for Llama 3.x when full W4A4 NVFP4 accuracy loss is high.
+
+imports:
+  base_disable_all: configs/ptq/units/base_disable_all
+  default_disabled_quantizers: configs/ptq/units/default_disabled_quantizers
+  nvfp4: configs/numerics/nvfp4
+  kv_fp8_cast: configs/ptq/units/kv_fp8_cast
+
+metadata:
+  recipe_type: ptq
+  description: >-
+    Llama PTQ recipe (model_type llama): dynamic NVFP4 on MLP/MoE layers only,
+    attention QKV projections stay in BF16/FP16, plus FP8 KV-cache cast mode.
+    Same numerics as general/ptq/nvfp4_mlp_only-kv_fp8_cast; use this path for
+    Llama 3.x checkpoints via --recipe huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast.
+quantize:
+  algorithm: max
+  quant_cfg:
+    - $import: base_disable_all
+    - quantizer_name: '*mlp*weight_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*mlp*input_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*block_sparse_moe*weight_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*block_sparse_moe*input_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*.experts.*weight_quantizer'
+      cfg:
+        $import: nvfp4
+    - quantizer_name: '*.experts.*input_quantizer'
+      cfg:
+        $import: nvfp4
+    - $import: kv_fp8_cast
+    - $import: default_disabled_quantizers
diff --git a/tests/unit/recipe/test_loader.py b/tests/unit/recipe/test_loader.py
@@ -163,7 +163,10 @@ def test_load_recipe_builtin_description():
     "general/ptq/nvfp4_experts_only-kv_fp8",
     "general/ptq/nvfp4_experts_only-kv_fp8_layerwise",
     "general/ptq/nvfp4_mlp_only-kv_fp8",
+    "general/ptq/nvfp4_mlp_only-kv_fp8_cast",
     "general/ptq/nvfp4_omlp_only-kv_fp8",
+    "huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast",
+    "huggingface/llama/ptq/nvfp4_default-kv_fp8_cast",
 ]