Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion docs/features/quantization/modelopt.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ following `quantization.quant_algo` values:
- `FP8`: per-tensor weight scale (+ optional static activation scale).
- `FP8_PER_CHANNEL_PER_TOKEN`: per-channel weight scale and dynamic per-token activation quantization.
- `FP8_PB_WO` (ModelOpt may emit `fp8_pb_wo`): block-scaled FP8 weight-only (typically 128×128 blocks).
- `NVFP4`: ModelOpt NVFP4 checkpoints (use `quantization="modelopt_fp4"`).
- `NVFP4`: ModelOpt W4A4 NVFP4 checkpoints (use `quantization="modelopt_fp4"`).
- `W4A16_NVFP4`: ModelOpt weight-only NVFP4 checkpoints with fp16/bf16 activations.
- `MIXED_PRECISION`: per-layer ModelOpt checkpoints that combine the formats above, for example FP8 attention layers with W4A16 NVFP4 MoE experts.
- `MXFP8`: ModelOpt MXFP8 checkpoints (use `quantization="modelopt_mxfp8"`).

## Quantizing HuggingFace Models with PTQ
Expand Down Expand Up @@ -102,6 +104,34 @@ vllm serve <path_to_exported_checkpoint> \
--host 0.0.0.0 --port 8000
```

## Serving W4A16 NVFP4 MoE checkpoints with Marlin

Some ModelOpt NVFP4 MoE checkpoints are exported as
`quantization.quant_algo = "MIXED_PRECISION"` and mark MoE expert layers (and
sometimes `lm_head`) as `W4A16_NVFP4` in `hf_quant_config.json`. This is a
weight-only NVFP4 format: weights are stored in 4-bit NVFP4, while activations
remain fp16/bf16. It is served by the Marlin W4A16 path, not by W4A4 kernels
that expect runtime activation quantization.

For reproducible debugging and benchmarking of W4A16 NVFP4 checkpoints on
CUDA GPUs where Marlin FP4 is available, you can explicitly pin the Marlin
linear and MoE backends:

```bash
vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
--quantization modelopt \
--linear-backend marlin \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid globally pinning Marlin for mixed FP8 layers

In the MIXED_PRECISION case described above where some non-MoE LinearBase layers are FP8, this global flag also forces those FP8 layers through MarlinFP8ScaledMMLinearKernel; choose_scaled_mm_linear_kernel filters by _get_linear_backend(), and that Marlin FP8 kernel rejects compute capability >= 89 unless VLLM_TEST_FORCE_FP8_MARLIN is set, so the documented SM120 command can fail at startup before reaching the W4A16 MoE path. Prefer only pinning --moe-backend marlin for these checkpoints, or document the extra env/caveat when FP8 linear layers are present.

Useful? React with 👍 / 👎.

--moe-backend marlin \
--kv-cache-dtype fp8_e4m3 \
--enable-chunked-prefill \
--enable-prefix-caching \
--host 0.0.0.0 --port 8000
```

When debugging startup, check the logs for the Marlin NVFP4 linear and MoE
backend selections. Also run a short generation sanity check before comparing
latency or throughput.

## Testing (local checkpoints)

vLLM's ModelOpt unit tests are gated by local checkpoint paths and are skipped
Expand Down
8 changes: 4 additions & 4 deletions vllm/model_executor/kernels/linear/nvfp4/marlin.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,10 @@ def can_implement(cls, config: NvFp4LinearLayerConfig) -> tuple[bool, str | None

def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
logger.warning_once(
"Your GPU does not have native support for FP4 computation but "
"FP4 quantization is being used. Weight-only FP4 compression "
"will be used leveraging the Marlin kernel. This may degrade "
"performance for compute-heavy workloads."
"Using Marlin for NVFP4 weight-only GEMM (W4A16). Activations "
"remain fp16/bf16 on this path; W4A4 NVFP4 checkpoints that "
"quantize activations should use a native NVFP4 backend when "
"available."
)
prepare_fp4_layer_for_marlin(layer)

Expand Down
Loading