[MoE/b12x] Accept W4A16 (kNvfp4Static, None) in FlashInferB12xExperts supports check by ECMGit · Pull Request #43332 · vllm-project/vllm

ECMGit · 2026-05-21T16:48:08Z

Purpose

FlashInferB12xExperts._supports_quant_scheme (introduced by PR #40082) currently requires the activation key to be kNvfp4Dynamic, which makes the dispatcher reject every W4A16 NVFP4 checkpoint (activation_key == None) — e.g. nvidia/Qwen3.6-35B-A3B-2.06GB-per-token. This forces such checkpoints onto Marlin, even though the b12x kernel itself is W4A16-compatible.

PR #42566 ("W4A16 NVFP4 fused MoE + mixed-precision dispatch") acknowledges this exact gap in its own commit message — "their _supports_quant_scheme requires (kNvfp4Static, kNvfp4Dynamic) exactly... only Marlin survives" — and deliberately routes W4A16 to Marlin as a workaround. This PR is the actual fix on the b12x side, complementary to #42566 and independent of its merge.

Per the b12x class docstring: "Input quantization (BF16→FP4) is performed inside the kernel so BF16 hidden states are passed directly." — i.e. the kernel already handles the BF16-activation case correctly. This PR only loosens the metadata gate; no kernel-side changes needed.

Change:

-        return (weight_key, activation_key) == (kNvfp4Static, kNvfp4Dynamic)
+        # b12x performs in-kernel BF16->FP4 activation quant, so W4A16
+        # NVFP4 checkpoints (activation_key=None, e.g. mixed-precision
+        # compressed-tensors layouts) are runtime-compatible.
+        return (weight_key, activation_key) in (
+            (kNvfp4Static, kNvfp4Dynamic),
+            (kNvfp4Static, None),
+        )

Failure mode without this PR:

ValueError: NvFp4 MoE backend 'FLASHINFER_B12X' does not support the deployment
configuration since kernel does not support quantization scheme
QuantKey(u8, scale(f8e4m3fn, static, GroupShape(row=1, col=16)),
         scale2(f32, static, per_tensor), symmetric) x None.

Risk: None — gate-only loosening. No kernel changes. No new dependencies. The existing (kNvfp4Static, kNvfp4Dynamic) path is preserved verbatim.

Test Plan

Hardware: DGX Spark (GB10, sm_121a)
Container: vllm/vllm-openai:nightly-aarch64 + this PR + the FP8-backend-env companion PR
Model: nvidia/Qwen3.6-35B-A3B-2.06GB-per-token (modelopt-native, mixed NVFP4 + FP8 experts)
Tooling: FlashInfer 0.6.11.post3, cutlass-dsl trio 4.5.1

Serve command:

export VLLM_NVFP4_GEMM_BACKEND=flashinfer-b12x
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_FP8_MOE_BACKEND=flashinfer_cutlass     # companion PR
export FLASHINFER_DISABLE_VERSION_CHECK=1
export CUTE_DSL_ARCH=sm_121a

vllm serve nvidia/Qwen3.6-35B-A3B-2.06GB-per-token \
    --tensor-parallel-size 1 --trust-remote-code --dtype auto \
    --kv-cache-dtype fp8 --attention-backend FLASHINFER \
    --gpu-memory-utilization 0.85 --max-model-len 40960 \
    --max-num-seqs 4 --max-num-batched-tokens 8192 \
    --enable-chunked-prefill --async-scheduling --enable-prefix-caching \
    --moe-backend=flashinfer_b12x --quantization=modelopt \
    --compilation-config '{"pass_config":{"fuse_norm_quant":true,"fuse_act_quant":true,"fuse_attn_quant":false}}' \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3,"rejection_sample_method":"synthetic","synthetic_acceptance_length":3.12,"moe_backend":"triton"}'

Benchmark (aiperf): K=3 AL=3.12, BS=1 / concurrency=1, ISL=2048 + 32K user-context (total ISL=34,831), OSL=1024, 10 warmup + 60 measured requests, --use-server-token-count --streaming.

Test Result

With this PR (b12x dispatcher accepts W4A16):

Metric	Value
Output Token Throughput	91.00 tok/s
Output Token Throughput / user	97.42 tok/s/user
TTFT	746.81 ms
ITL	10.27 ms
Request Latency	11,249.37 ms
MTP acceptance length	3.15 (target 3.12) ✓
Requests / errors	60 / 0

Without this PR: serve dies at engine init with the ValueError shown above before any throughput can be measured.

Reference on the same hardware/workload (sanity comparison):

Path	OTT (tok/s)	TTFT (ms)
Marlin (current W4A16 fallback in #42566)	92.26	798.72
b12x on dgx-fork (matched-cubin FlashInfer)	95.15	758.90
b12x on upstream (this PR)	91.00	746.81

With this PR, b12x runs cleanly on W4A16 NVFP4 checkpoints and reaches parity with Marlin on the public wheel stack (the small ~4% delta vs the dgx-fork b12x number is attributable to a known flashinfer-cubin 0.6.11.post2 / flashinfer 0.6.11.post3 version-skew in the public PyPI wheel — unrelated to this PR).

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

github-actions · 2026-05-21T16:48:21Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

gemini-code-assist

Code Review

This pull request updates the flashinfer_b12x_moe expert to support W4A16 NVFP4 checkpoints by allowing None as an activation quantization key in the _supports_quant_scheme method. While this enables the dispatcher to select the backend, feedback points out that the apply method contains a strict assertion requiring an activation scale (a2_gscale), which will likely cause an AssertionError for these models. It is recommended to update the apply method to provide a default scale tensor instead of asserting.

ECMGit · 2026-05-21T17:48:53Z

add @meena-at-work, @juhi10071998 as reviewer

ECMGit · 2026-05-22T04:49:27Z

Crossing reference with #43341 (@askliar) which also touches FlashInferB12xExperts for W4A16 NVFP4 — I am confirming the two PRs are complementary, not duplicate:

Quant scheme	Path	This PR (#43332)	#43341
`(kNvfp4Static, kNvfp4Dynamic)`	modelopt W4A4	✓ (preserved)	✓ (preserved)
`(kNvfp4Static, None)`	modelopt-native W4A16	✓ (new, this PR)	—
`(uint8 + kNvfp4StaticGroupScale + kStaticTensorScale, None)`	compressed-tensors `nvfp4-pack-quantized` W4A16	—	✓ (new, #43341)

So this PR covers the modelopt-native W4A16 family (e.g. nvidia/Qwen3.6-35B-A3B-2.06GB-per-token), while #43341 covers the compressed-tensors nvfp4-pack-quantized family (e.g. nvidia/Nemotron-H-3.5-*). #43341's own description acknowledges this split: "Both could land in any order; together they cover both W4A16 sources."

The only mechanical conflict is in the body of _supports_quant_scheme: whichever PR lands second needs a ~3-line rebase to combine both accepted forms.

One important note for landing order

The apply() a2_gscale=1.0 default in this PR's second commit (ac2187a72) is NOT in #43341. That commit replaces the strict assert self.a2_gscale is not None with a uniform-1.0 fallback per expert (symmetric to the existing if self.a2_gscale is not None: fill_(1.0) guard one method above in process_weights_after_loading).

This matters because compressed-tensors nvfp4-pack-quantized W4A16 (the case #43341 enables) has activation_key == None, so the FusedMoE machinery does not populate a2_gscale — and #43341 still uses self.a2_gscale directly in its _wrapper.run(...) call. Without the default-fallback fix, that path would hit AssertionError: a2_gscale must not be None at the first inference request.

So even if #43341 lands first, this PR's Change 2 is still needed end-to-end for the compressed-tensors W4A16 path to actually serve requests.

cc @askliar — happy to coordinate landing order, or if you'd prefer I can rebase Change 2 onto your apply() refactor so reviewers see a single coherent W4A16 picture.

… supports check `FlashInferB12xExperts._supports_quant_scheme` currently requires the activation key to be `kNvfp4Dynamic`, which makes the dispatcher reject every W4A16 NVFP4 checkpoint (activation_key == None) -- e.g. `nvidia/Qwen3.6-35B-A3B-2.06GB-per-token`. This forces such checkpoints onto Marlin even though the b12x kernel itself is W4A16-compatible. Per the class docstring: "Input quantization (BF16->FP4) is performed inside the kernel so BF16 hidden states are passed directly." -- i.e. the kernel already handles the BF16-activation case correctly. This change only loosens the metadata gate; no kernel-side changes. PR vllm-project#42566 ("W4A16 NVFP4 fused MoE + mixed-precision dispatch") only touches quantization/modelopt.py and acknowledges the gap in its own commit message: "their _supports_quant_scheme requires (kNvfp4Static, kNvfp4Dynamic) exactly... only Marlin survives." That PR deliberately routes W4A16 to Marlin as a workaround; this PR is the fix on the b12x side. The two are complementary and can land independently -- once both land, W4A16 NVFP4 prefers b12x (fast path). Failure mode without this PR: ValueError: NvFp4 MoE backend 'FLASHINFER_B12X' does not support the deployment configuration since kernel does not support quantization scheme QuantKey(u8, scale(f8e4m3fn, static, GroupShape(row=1, col=16)), scale2(f32, static, per_tensor), symmetric) x None. Tested on DGX Spark (GB10, sm_121a) with vllm/vllm-openai:nightly-aarch64 + this PR + the FP8-backend-env companion PR. Model: nvidia/Qwen3.6-35B-A3B-2.06GB-per-token (modelopt-native, mixed NVFP4 + FP8 experts). aiperf K=3 AL=3.12, BS=1, ISL=2048+32K prefix=34,831, OSL=1024, 60 measured + 10 warmup, 0 errors: Output Token Throughput : 91.00 tok/s Output Token Throughput / user : 97.42 tok/s/user TTFT : 746.81 ms ITL : 10.27 ms Request Latency : 11,249.37 ms MTP acceptance length : 3.15 (target 3.12) For reference on the same workload: Marlin (current W4A16 fallback) : OTT 92.26, TTFT 798.72 b12x on dgx-fork (matched cubin): OTT 95.15, TTFT 758.90 Without this change b12x rejects the checkpoint at engine init; with it b12x runs and matches/exceeds Marlin on the b12x-fast path. Signed-off-by: Junhao Shen <junshen@nvidia.com>

…hout activation quant metadata Addresses review feedback on the preceding commit (supports-check loosening for W4A16). `FlashInferB12xExperts.apply` previously asserted `self.a2_gscale is not None` unconditionally. For W4A16 NVFP4 checkpoints lacking static activation-quant metadata (e.g. compressed-tensors W4A16-CT layouts), `a2_gscale` is legitimately None and the assert fires at the first forward pass -- strictly worse than the engine-init rejection we just removed at the dispatcher gate. `process_weights_after_loading` already tolerates `a2_gscale is None` (the `if self.a2_gscale is not None: ...` guard at the top of this file), so the assert is the inconsistency. The b12x kernel performs dynamic per-block FC2-input quantization internally, so a uniform 1.0 scale per expert is semantically equivalent to the bake-in done for static-quant checkpoints. Construct the default in apply() instead of asserting. Signed-off-by: Junhao Shen <junshen@nvidia.com>

vadiklyutiy · 2026-05-28T20:26:54Z

+        # quant checkpoints.
+        fc2_input_scale = self.a2_gscale
+        if fc2_input_scale is None:
+            fc2_input_scale = torch.ones(


This is apply and hot path. I think it is better do any allocation somewhere early.

okay updated, route to early stage

vadiklyutiy · 2026-05-29T14:21:17Z

@claude review pls

mergify · 2026-05-29T14:50:12Z

Hi @ECMGit, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-05-29T14:57:51Z

Hi @ECMGit, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

adding self._fc2_input_scale since the self.a2_gscale is a read-only attribute Signed-off-by: Junhao Shen <junshen@nvidia.com>

ECMGit · 2026-05-31T13:55:24Z

Hi @vadiklyutiy , the pre-commit has passed. let me know if further change is needed.

xinli-sw

LGTM!

mgoin

LGTM, thanks!

… supports check (vllm-project#43332) Signed-off-by: Junhao Shen <junshen@nvidia.com> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

… supports check (vllm-project#43332) Signed-off-by: Junhao Shen <junshen@nvidia.com> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>

… supports check (vllm-project#43332) Signed-off-by: Junhao Shen <junshen@nvidia.com> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Signed-off-by: JisoLya <523420504@qq.com>

mergify Bot added the nvidia label May 21, 2026

github-project-automation Bot added this to NVIDIA May 21, 2026

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/experts/flashinfer_b12x_moe.py

askliar mentioned this pull request May 21, 2026

[B12x][NVFP4] Add W4A16 support to FlashInfer SM12x MoE wrapper #43341

Open

2 tasks

ECMGit marked this pull request as ready for review May 21, 2026 17:48

ECMGit requested review from mgoin, pavanimajety and zyongye as code owners May 21, 2026 17:48

askliar mentioned this pull request May 26, 2026

[MoE/b12x] Add cutlass_prefill_threshold hybrid dispatch to FlashInferB12xExperts #43687

Open

5 tasks

ECMGit added 2 commits May 27, 2026 01:34

ECMGit force-pushed the b12x-w4a16-supports branch from 8b7c0a0 to bcc299b Compare May 26, 2026 17:35

Kh4L approved these changes May 26, 2026

View reviewed changes

ECMGit mentioned this pull request May 28, 2026

[MoE] Add VLLM_FP8_MOE_BACKEND to override FP8-expert routing indepently of --moe-backend #43334

Open

4 tasks

vadiklyutiy added the verified Run pre-commit for new contributors without triggering other tests label May 28, 2026

Merge branch 'main' into b12x-w4a16-supports

704ab6e

vadiklyutiy reviewed May 28, 2026

View reviewed changes

meena-at-work mentioned this pull request May 28, 2026

b12x nvfp4 w4a16 use a16 fix #43929

Draft

ECMGit force-pushed the b12x-w4a16-supports branch from 384f8f1 to 5d23ea5 Compare May 29, 2026 14:51

move fc2_input_scale = torch.ones(...) out of hot path in a early step

0114e1d

adding self._fc2_input_scale since the self.a2_gscale is a read-only attribute Signed-off-by: Junhao Shen <junshen@nvidia.com>

ECMGit force-pushed the b12x-w4a16-supports branch from 5d23ea5 to 0114e1d Compare May 31, 2026 12:16

xinli-sw reviewed Jun 2, 2026

View reviewed changes

xinli-sw approved these changes Jun 2, 2026

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 2, 2026

mgoin approved these changes Jun 2, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Jun 2, 2026

vadiklyutiy enabled auto-merge (squash) June 2, 2026 21:42

vadiklyutiy approved these changes Jun 2, 2026

View reviewed changes

vllm-bot merged commit a4ac746 into vllm-project:main Jun 2, 2026
68 of 70 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 2, 2026

Uh oh!

Conversation

ECMGit commented May 21, 2026

Purpose

Test Plan

Test Result

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ECMGit commented May 21, 2026

Uh oh!

ECMGit commented May 22, 2026

One important note for landing order

Uh oh!

vadiklyutiy May 28, 2026

Choose a reason for hiding this comment

Uh oh!

ECMGit May 29, 2026

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented May 29, 2026

Uh oh!

mergify Bot commented May 29, 2026

Uh oh!

mergify Bot commented May 29, 2026

Uh oh!

ECMGit commented May 31, 2026

Uh oh!

xinli-sw left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants