Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
b313689
fix issues with nvfp4 dense emulation in vllm (squash)
fxmarty-amd Mar 2, 2026
bc6ff39
address comments
fxmarty-amd Mar 2, 2026
14bc668
nvfp4 moe emulation support
fxmarty-amd Mar 2, 2026
a11d131
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Mar 2, 2026
95c6a4a
wip use TritonExperts
fxmarty-amd Mar 2, 2026
5a2cf8c
wip cleanup
fxmarty-amd Mar 2, 2026
0ea8f82
wip cleanup
fxmarty-amd Mar 2, 2026
d99373e
wip cleanup
fxmarty-amd Mar 2, 2026
7a5f2ba
fix activation quantization
fxmarty-amd Mar 2, 2026
457f9df
address comment
fxmarty-amd Mar 2, 2026
86d6316
aot weight dequantization
fxmarty-amd Mar 3, 2026
2cb040b
use emulation_dequantize_weights for quark OCP MX as well
fxmarty-amd Mar 3, 2026
7a67180
tiny fix
fxmarty-amd Mar 3, 2026
01b4dce
enable test on non-blackwell devices
fxmarty-amd Mar 3, 2026
aef916d
Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…
fxmarty-amd Mar 3, 2026
c4aff81
add test
fxmarty-amd Mar 3, 2026
4710a00
add test
fxmarty-amd Mar 3, 2026
affdda7
support quark dense and moe nvfp4
fxmarty-amd Mar 3, 2026
da111bd
wip cleanup
fxmarty-amd Mar 3, 2026
0cc4207
bug fixes and add test
fxmarty-amd Mar 3, 2026
1d6c770
Merge branch 'main' into upstream-nvfp4-simulation-support-moe
fxmarty-amd Mar 4, 2026
cf189ef
cleanup
fxmarty-amd Mar 4, 2026
b83ea66
aot weight dequantization
fxmarty-amd Mar 4, 2026
b74afa8
use emulation_dequantize_weights for quark OCP MX as well
fxmarty-amd Mar 3, 2026
913824f
tiny fix
fxmarty-amd Mar 3, 2026
c473004
add test
fxmarty-amd Mar 3, 2026
43345ed
add test
fxmarty-amd Mar 3, 2026
6cc2a0d
Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…
fxmarty-amd Mar 4, 2026
a8c7ee8
fix
fxmarty-amd Mar 4, 2026
ca2c2b8
Merge branch 'upstream-nvfp4-simulation-aot-weight-dequantization' in…
fxmarty-amd Mar 4, 2026
dbc5fb5
fix moe_mk.apply
fxmarty-amd Mar 4, 2026
6db0c7b
Merge branch 'main-upstream' into upstream-nvfp4-simulation-support-rocm
fxmarty-amd Mar 4, 2026
ec1f4b8
address comment
fxmarty-amd Mar 4, 2026
309cefb
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Mar 4, 2026
cca5040
fix
fxmarty-amd Mar 4, 2026
6f08a2d
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Mar 4, 2026
c7cfa6b
Merge branch 'upstream-nvfp4-simulation-aot-weight-dequantization' in…
fxmarty-amd Mar 4, 2026
0094cb9
important note about parallel layers
fxmarty-amd Mar 4, 2026
d440f75
fix wrong inversion
fxmarty-amd Mar 4, 2026
80b6b6c
Merge branch 'upstream-nvfp4-simulation-aot-weight-dequantization' in…
fxmarty-amd Mar 4, 2026
797b856
remove weight scale inversion
fxmarty-amd Mar 4, 2026
dc16065
use min for a13_scale
fxmarty-amd Mar 4, 2026
9007357
Merge branch 'main' into upstream-nvfp4-simulation-support-rocm
fxmarty-amd Mar 5, 2026
e7d72f5
address bowen's comments
fxmarty-amd Mar 6, 2026
e3a8ebd
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Mar 6, 2026
311d47d
linting
fxmarty-amd Mar 6, 2026
74e6eec
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Mar 6, 2026
bf46483
use a single global scale for a2 in MOE, following flashinfer default…
fxmarty-amd Mar 6, 2026
0b47522
do not modify test_blackwell_moe
fxmarty-amd Mar 6, 2026
4a5c5c1
fix test and typo
fxmarty-amd Mar 6, 2026
6ed0611
fix typo
fxmarty-amd Mar 6, 2026
80a37f6
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Mar 6, 2026
35c88a8
simplify test
fxmarty-amd Mar 6, 2026
d495ef7
Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…
fxmarty-amd Mar 6, 2026
d439e80
remove outdated comment
fxmarty-amd Mar 6, 2026
de79775
Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…
fxmarty-amd Mar 6, 2026
58b90f1
Merge branch 'upstream-nvfp4-simulation-aot-weight-dequantization' in…
fxmarty-amd Mar 6, 2026
a5da270
revert min change
fxmarty-amd Mar 6, 2026
2d9e65c
Merge branch 'main' into upstream-nvfp4-simulation-support-rocm
fxmarty-amd Mar 24, 2026
c6791f7
address Michael's comments
fxmarty-amd Mar 26, 2026
1fa136e
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Mar 30, 2026
56dd2bf
Merge branch 'main' into upstream-nvfp4-simulation-support-rocm
fxmarty-amd Apr 1, 2026
ad93d2a
linting
fxmarty-amd Apr 1, 2026
0d788d8
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Apr 1, 2026
e8a596f
Update vllm/model_executor/layers/quantization/compressed_tensors/sch…
fxmarty-amd Apr 1, 2026
c6adfe8
Update vllm/model_executor/layers/quantization/compressed_tensors/sch…
fxmarty-amd Apr 1, 2026
e36296a
move unsupported reasons warning in is_backend_supported
fxmarty-amd Apr 1, 2026
33f118f
Merge branch 'upstream-nvfp4-simulation-support-rocm' of https://gith…
fxmarty-amd Apr 1, 2026
44aadca
fix input
fxmarty-amd Apr 1, 2026
3f36269
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Apr 2, 2026
911b316
addres Michael's comments
fxmarty-amd Apr 2, 2026
90a54e3
simulation -> emulation
fxmarty-amd Apr 2, 2026
74b9212
linting
fxmarty-amd Apr 2, 2026
d930b84
Merge branch 'main' into upstream-nvfp4-simulation-support-rocm
fxmarty-amd Apr 2, 2026
24ec4ce
pre-commit passes locally and should not take 50min
fxmarty-amd Apr 2, 2026
58439aa
Merge branch 'upstream-nvfp4-simulation-support-rocm' into upstream-n…
fxmarty-amd Apr 3, 2026
34fba54
Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…
fxmarty-amd Apr 3, 2026
70b2d5d
remove unnecessary changes
fxmarty-amd Apr 3, 2026
0b6b325
fix
fxmarty-amd Apr 3, 2026
0b2de40
fix
fxmarty-amd Apr 3, 2026
8e61be3
Merge branch 'main' into upstream-nvfp4-simulation-support-moe
fxmarty-amd Apr 8, 2026
f2204ce
refactor OCP MX MOE emulation and address comment about moe_kernel_qu…
fxmarty-amd Apr 8, 2026
ca07f68
move to experts subfolder
fxmarty-amd Apr 8, 2026
223c275
simplifications
fxmarty-amd Apr 9, 2026
d8e9283
linting
fxmarty-amd Apr 9, 2026
1e1d139
Merge branch 'main' into upstream-nvfp4-simulation-support-moe
fxmarty-amd Apr 9, 2026
757c1bc
Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…
fxmarty-amd Apr 9, 2026
68257c5
remove unnecessary changes
fxmarty-amd Apr 9, 2026
2b74e98
fix issues
fxmarty-amd Apr 9, 2026
896ef66
linting
fxmarty-amd Apr 9, 2026
3663f59
fix quant_dtype
fxmarty-amd Apr 9, 2026
28ef57d
outdated comment
fxmarty-amd Apr 9, 2026
adfb9da
precise comment about maybe_roundup_sizes
fxmarty-amd Apr 13, 2026
9513361
add Qwen3-30B-A3B-NVFP4, Qwen3.5-35B-A3B-MXFP4-TP2 to gfx942 tests
fxmarty-amd Apr 13, 2026
c06e387
Merge branch 'main' into upstream-nvfp4-simulation-support-moe
fxmarty-amd Apr 13, 2026
df32bf3
Merge branch 'upstream-nvfp4-simulation-support-moe' of https://githu…
fxmarty-amd Apr 13, 2026
58c499b
Merge branch 'upstream-nvfp4-simulation-support-moe' into upstream-nv…
fxmarty-amd Apr 13, 2026
dee3b31
update to use the kernel abstraction
fxmarty-amd Apr 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions tests/evals/gsm8k/configs/models-mi3xx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@ DeepSeek-R1-TP_MI325.yaml
DeepSeek-R1-DP_MI325.yaml
DeepSeek-V3.2-TP_MI325.yaml
DeepSeek-V3.2-DP_MI325.yaml
Qwen3-30B-A3B-NVFP4.yaml
Qwen3.5-35B-A3B-MXFP4-TP2.yaml
17 changes: 17 additions & 0 deletions tests/models/quantization/test_nvfp4.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,20 @@ def test_nvfp4(vllm_runner, model, eager, backend, monkeypatch):
with vllm_runner(model, enforce_eager=eager) as llm:
output = llm.generate_greedy(["1 2 3 4 5"], max_tokens=2)
assert output[0][1] == "1 2 3 4 5 6"


# Qwen3-30B-A3B is 60 GB vs Llama-4-Scout-17B-16E-Instruct-FP4 that is 210 GB.
@pytest.mark.parametrize(
"model",
[
"nvidia/Qwen3-30B-A3B-NVFP4",
"RedHatAI/Qwen3-30B-A3B-NVFP4",
],
)
@pytest.mark.parametrize("eager", EAGER)
@pytest.mark.parametrize("backend", ["emulation"])
def test_nvfp4_moe(vllm_runner, model, eager, backend, monkeypatch):
monkeypatch.setenv("VLLM_NVFP4_GEMM_BACKEND", backend)
with vllm_runner(model, enforce_eager=eager, moe_backend="emulation") as llm:
output = llm.generate_greedy(["1 2 3 4 5"], max_tokens=2)
assert output[0][1] == "1 2 3 4 5 6"
59 changes: 57 additions & 2 deletions tests/quantization/test_quark.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,8 @@ def get_model_args(
model_name="fxmarty/qwen1.5_moe_a2.7b_chat_w_fp6_e3m2_a_fp6_e3m2",
excepted_value=10.6,
),
# This one raises `RuntimeError: wrong! device_gemm with the specified compilation
# parameters does not support this GEMM problem` on MI355X.
AccuracyTestConfig(
model_name="fxmarty/qwen_1.5-moe-a2.7b-mxfp4", excepted_value=12.4
),
Expand All @@ -238,8 +240,13 @@ def get_model_args(
not QUARK_MXFP4_AVAILABLE,
reason=f"amd-quark>={QUARK_MXFP4_MIN_VERSION} is not available",
)
@pytest.mark.parametrize("config", WIKITEXT_ACCURACY_CONFIGS)
@pytest.mark.parametrize("tp_size", [1, 2])
@pytest.mark.parametrize(
"config",
[pytest.param(val, id=f"config:{val}") for val in WIKITEXT_ACCURACY_CONFIGS],
)
@pytest.mark.parametrize(
"tp_size", [pytest.param(val, id=f"tp_size:{val}") for val in [1, 2]]
)
def test_ocp_mx_wikitext_correctness(config: AccuracyTestConfig, tp_size: int):
device_count = torch.accelerator.device_count()
if device_count < tp_size:
Expand All @@ -266,6 +273,54 @@ def test_ocp_mx_wikitext_correctness(config: AccuracyTestConfig, tp_size: int):
), f"Expected: {EXPECTED_VALUE} | Measured: {measured_value}"


@pytest.mark.skipif(
not QUARK_MXFP4_AVAILABLE,
reason=f"amd-quark>={QUARK_MXFP4_MIN_VERSION} is not available",
)
@pytest.mark.parametrize("tp_size", [1, 2])
def test_nvfp4_wikitext_correctness(tp_size: int):
device_count = torch.accelerator.device_count()
if device_count < tp_size:
pytest.skip(f"This test requires >={tp_size} gpus, got only {device_count}")

# model_name = "amd-quark/Qwen3-30B-A3B-nvfp4-quark"
# NOTE: expected_value from nvidia/Qwen3-30B-A3B-NVFP4
expected_value = 11.2391

model_name = "amd-quark/Qwen3-30B-A3B-nvfp4-quark"
task = "wikitext"

rtol = 0.25

config = AccuracyTestConfig(
model_name=model_name,
excepted_value=expected_value,
)

model_args = config.get_model_args(
tp_size=tp_size,
kwargs={
"cudagraph_capture_sizes": [16],
},
)
model_args.pop("add_bos_token")

# Smaller cudagraph_capture_sizes to speed up the test.
results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks=task,
batch_size=64,
)

EXPECTED_VALUE = config.excepted_value
measured_value = results["results"][task]["word_perplexity,none"]
assert (
measured_value < EXPECTED_VALUE + rtol
and measured_value > EXPECTED_VALUE - rtol
), f"Expected: {EXPECTED_VALUE} | Measured: {measured_value}"


@pytest.mark.parametrize("config", GSM8K_ACCURACY_CONFIGS)
@pytest.mark.skipif(
not QUARK_MXFP4_AVAILABLE,
Expand Down
6 changes: 5 additions & 1 deletion vllm/config/kernel.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ def with_default(
"flashinfer_cutedsl",
"marlin",
"aiter",
"emulation",
]


Expand Down Expand Up @@ -142,7 +143,10 @@ class KernelConfig:
- "flashinfer_cutlass": Use FlashInfer with CUTLASS kernels
- "flashinfer_cutedsl": Use FlashInfer with CuteDSL kernels (FP4 only)
- "marlin": Use Marlin kernels (weight-only quantization)
- "aiter": Use AMD AITer kernels (ROCm only)"""
- "aiter": Use AMD AITer kernels (ROCm only)
- "emulation": use BF16/FP16 GEMM, dequantizing weights and
running QDQ on activations.
"""

@field_validator("moe_backend", mode="before")
@classmethod
Expand Down
164 changes: 164 additions & 0 deletions vllm/model_executor/layers/fused_moe/experts/nvfp4_emulation_moe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
NVFP4 quantization emulation for MoE.

This file implements NVFP4 emulation for NVFP4 MOE in case the hardware used does not
natively support NVFP4 MOE.

Weights are dequantized on the fly during each forward, we fall back to calling
`TritonExperts` using BF16, and fake NVFP4 quantize-dequantize
is applied on `a13`, `a2`.
"""

import torch

import vllm.model_executor.layers.fused_moe.modular_kernel as mk
from vllm.logger import init_logger
from vllm.model_executor.layers.fused_moe.activation import MoEActivation
from vllm.model_executor.layers.fused_moe.config import (
FusedMoEConfig,
FusedMoEQuantConfig,
)
from vllm.model_executor.layers.fused_moe.fused_moe import TritonExperts
from vllm.model_executor.layers.fused_moe.utils import moe_kernel_quantize_input
from vllm.model_executor.layers.quantization.utils.nvfp4_emulation_utils import (
dequantize_to_dtype,
)
from vllm.model_executor.layers.quantization.utils.quant_utils import (
QuantKey,
kNvfp4Dynamic,
kNvfp4Static,
)

logger = init_logger(__name__)


class Nvfp4QuantizationEmulationTritonExperts(TritonExperts):
"""
Extension of TritonExperts to support emulated NVFP4 MoE experts.

It may be used for NVFP4 models when the device does not have
native support for this dtype.
"""

def __init__(
self,
moe_config: FusedMoEConfig,
quant_config: FusedMoEQuantConfig,
):
super().__init__(moe_config, quant_config)
logger.warning_once(
"Using Nvfp4QuantizationEmulationTritonExperts MOE backend. This will"
" dequantize weights on the fly and may be slower than native"
" quantized MOE. Consider using a device with native quantization"
" support (e.g. Nvidia Blackwell) for better performance."
)

# `TritonExperts.apply` expects pre-dequantized weights,
# which we handle in `apply` below.
self.w1_scale_val = self.quant_config.w1_scale
self.w2_scale_val = self.quant_config.w2_scale

self.quant_config._w1.scale = None
self.quant_config._w2.scale = None

self.quantization_emulation = True

@property
def quant_dtype(self) -> torch.dtype | str | None:
return "nvfp4"

@property
def expects_unquantized_inputs(self) -> bool:
return True

@staticmethod
def _supports_quant_scheme(
weight_key: QuantKey | None,
activation_key: QuantKey | None,
) -> bool:
return (weight_key, activation_key) == (kNvfp4Static, kNvfp4Dynamic)

def apply(
self,
output: torch.Tensor,
hidden_states: torch.Tensor,
w1: torch.Tensor,
w2: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
activation: MoEActivation,
global_num_experts: int,
expert_map: torch.Tensor | None,
a1q_scale: torch.Tensor | None,
a2_scale: torch.Tensor | None,
workspace13: torch.Tensor,
workspace2: torch.Tensor,
expert_tokens_meta: mk.ExpertTokensMetadata | None,
apply_router_weight_on_input: bool,
):
"""
Apply emulated quantized MoE computation.

This dequantizes the weights on the fly and calls fused_experts_impl
with activation quantization support.
"""
# Dequantize weights if they are quantized
# For NVFP4, weights are packed in uint8 format
# w1 shape: [num_experts, 2*intermediate_size, hidden_size//2]
# w2 shape: [num_experts, hidden_size, intermediate_size//2]
assert w1.dtype == torch.uint8
assert w2.dtype == torch.uint8

# Dequantize w1 from packed NVFP4 to fp16/bf16
w13_global_scale = self.quant_config.g1_alphas

w1_dequant = dequantize_to_dtype(
tensor_fp4=w1,
tensor_sf=self.w1_scale_val,
global_scale=w13_global_scale,
dtype=hidden_states.dtype,
block_size=16,
swizzle=False,
)

# Dequantize w2 from packed NVFP4 to fp16/bf16
w2_global_scale = self.quant_config.g2_alphas

w2_dequant = dequantize_to_dtype(
tensor_fp4=w2,
tensor_sf=self.w2_scale_val,
global_scale=w2_global_scale,
dtype=hidden_states.dtype,
block_size=16,
swizzle=False,
)

hidden_states, _ = moe_kernel_quantize_input(
A=hidden_states,
A_scale=self.quant_config.a1_gscale,
quant_dtype="nvfp4",
per_act_token_quant=False,
quantization_emulation=True,
)

# Activation quantization/dequantization is deferred to
# `moe_kernel_quantize_input` in TritonExperts.apply.
super().apply(
output=output,
hidden_states=hidden_states,
w1=w1_dequant,
w2=w2_dequant,
topk_weights=topk_weights,
topk_ids=topk_ids,
activation=activation,
global_num_experts=global_num_experts,
expert_map=expert_map,
a1q_scale=None,
a2_scale=self.quant_config.a2_gscale,
workspace13=workspace13,
workspace2=workspace2,
expert_tokens_meta=expert_tokens_meta,
apply_router_weight_on_input=apply_router_weight_on_input,
)
Loading
Loading