[Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding by LopezCastroRoberto · Pull Request #30885 · vllm-project/vllm

LopezCastroRoberto · 2025-12-17T16:33:10Z

Summary

This PR adds an opt-in NVFP4 backend variant that uses smaller scaling-factor tiling (8x4 SF layout). The change targets small-concurrency decode workloads and delivers ~25–35% higher output token throughput compared to the current best NVFP4 backend at small batch sizes.

Note: this backend is not recommended for medium or large batch sizes. Benefits are limited to the small-batch decode regime - see below.

The backend is automatically selected when:

export VLLM_NVFP4_GEMM_BACKEND=flashinfer-trtllm

and batch size <= 32.

Microbenchmark results - Quantization overhead included for FP4 numbers

Notes

All results except FBGEMM are FlashInfer implementations
Autotuning enabled for all runs

Example (10240x8192 weights, `nvidia/Llama-3.3-70B-Instruct-NVFP4`):

Additional examples:

Model: meta-llama/Llama-3.1-8B-Instruct
Metric: TFLOP/s
Problem size: N = 28672, K = 4096

Batch Size	torch-bf16	cutlass-nvfp4	trtllm-nvfp4	cudnn-nvfp4	trtllm_8x4_sf-nvfp4 (new)	fbgemm-nvfp4
1	5.97	13.45	8.32	7.80	20.42	12.88
2	11.93	30.99	17.92	17.85	41.60	25.46
4	23.85	67.47	36.54	43.55	83.56	60.14
8	47.41	141.56	77.58	92.25	164.95	119.05
16	94.05	285.90	161.63	184.94	322.11	236.75
32	182.10	575.06	334.03	373.74	369.62	461.51
64	357.33	1146.03	715.13	698.12	448.99	1001.14
128	607.87	1946.30	1851.35	1192.37	508.45	2015.76
256	1031.79	2839.15	2308.84	2583.32	541.70	2858.62
512	1148.44	3490.14	2840.43	2961.75	558.83	2954.76
1024	1131.31	4557.48	3029.03	3564.80	568.11	3489.49
2048	1294.41	4911.02	3152.24	3635.49	570.32	3857.35
4096	1365.47	4819.24	3154.10	3748.01	574.34	3868.30
8192	1446.52	4958.41	3170.63	3844.16	577.77	3901.14
16384	1451.02	4798.02	3250.96	4069.28	578.47	3938.36

Model: meta-llama/Llama-3.3-70B-Instruct
Metric: TFLOP/s
Problem size: N = 8192, K = 8192

Batch Size	torch-bf16	cutlass-nvfp4	trtllm-nvfp4	cudnn-nvfp4	trtllm_8x4_sf-nvfp4 (new)	fbgemm-nvfp4
1	5.72	7.92	5.37	5.61	14.08	10.52
2	11.41	18.32	12.16	15.01	28.08	22.54
4	22.86	39.38	25.59	32.04	56.19	44.83
8	45.65	82.76	54.63	66.37	111.44	89.63
16	84.19	167.37	112.83	134.32	219.55	175.97
32	165.89	337.50	243.26	268.83	432.75	347.74
64	320.00	671.20	550.12	523.06	501.44	684.05
128	656.17	1233.54	1248.79	953.45	553.85	1313.27
256	1029.73	1955.48	2319.43	1882.18	645.24	2580.77
512	1315.83	3125.66	2609.10	3303.98	667.88	3365.54
1024	1345.53	3641.77	2636.36	3851.73	685.73	3578.31
2048	1335.88	4185.16	3166.94	3849.99	699.08	3181.51
4096	1345.35	4580.09	3213.39	4049.13	700.62	4062.80
8192	1441.36	4625.21	3292.62	4178.33	709.65	3890.69
16384	1452.46	4715.90	3381.17	4239.30	712.55	3930.54

trtllm_8x4_sf-nvfp4 is consistently best at batch ≤ 16, and sometimes bs=32, which aligns with the target decode regime.

Preliminary Results

Setup

GPU: B200
TP size: 1
Concurrency: --max-concurrency=1

Small-batch Decode Throughput (FP4 + FP8)

Model: `nvidia/Llama-3.3-70B-Instruct`

Precision	Backend	Output Token Throughput (tok/s)	Mean TPOT (ms)
NVFP4	flashinfer-cutlass	67.13	14.89
NVFP4	flashinfer-cudnn	54.75	18.26
NVFP4	flashinfer-trtllm	48.09	20.79
NVFP4	flashinfer-trtllm_8x4_sf (new)	83.89	11.91
FP8	(reference)	64.59	15.48

Model: `nvidia/Llama-3.1-8B-Instruct`

Precision	Backend	Output Token Throughput (tok/s)	Mean TPOT (ms)
NVFP4	flashinfer-cutlass	240.32	4.15
NVFP4	flashinfer-trtllm_8x4_sf (new)	327.69	3.04
FP8	(reference)	283.36	3.52

Relative to the current best NVFP4 baseline (flashinfer-cutlass):

~25–35% higher output token throughput at small batch / low concurrency

Relative to the current FP8 baseline:

~15–30% higher output token throughput at small batch / low concurrency

Full logs for flashinfer-trtllm_8x4_sf below

nvidia/Llama-3.3-70B-Instruct-NVFP4:

============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  24.41     
Total input tokens:                      2         
Total generated tokens:                  2048      
Request throughput (req/s):              0.08      
Output token throughput (tok/s):         83.89     
Peak output token throughput (tok/s):    87.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          83.97     
---------------Time to First Token----------------
Mean TTFT (ms):                          22.18     
Median TTFT (ms):                        22.18     
P99 TTFT (ms):                           22.97     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.91     
Median TPOT (ms):                        11.91     
P99 TPOT (ms):                           12.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.91     
Median ITL (ms):                         11.54     
P99 ITL (ms):                            19.34     
==================================================

nvidia/Llama-3.1-8B-Instruct-NVFP4

============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  6.25      
Total input tokens:                      2         
Total generated tokens:                  2048      
Request throughput (req/s):              0.32      
Output token throughput (tok/s):         327.69    
Peak output token throughput (tok/s):    329.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          328.01    
---------------Time to First Token----------------
Mean TTFT (ms):                          10.21     
Median TTFT (ms):                        10.21     
P99 TTFT (ms):                           10.75     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.04      
Median TPOT (ms):                        3.04      
P99 TPOT (ms):                           3.05      
---------------Inter-token Latency----------------
Mean ITL (ms):                           3.04      
Median ITL (ms):                         3.04      
P99 ITL (ms):                            3.39      
==================================================

Note

Optimizes NVFP4 small-batch decode via a smaller scaling-factor tiling and integrates it end-to-end.

FlashInfer: extend mm_fp4 with use_8x4_sf_layout; add custom op vllm::flashinfer_nvfp4_quantize and helper flashinfer_quant_nvfp4_8x4_sf_layout; flashinfer_scaled_fp4_mm auto-enables 8x4 SF for trtllm when rows ≤ 32
Linear paths: in compressed_tensors_w4a4_nvfp4.py (and modelopt) auto-switch to 8x4 SF quantization for flashinfer-trtllm small inputs; otherwise use scaled_fp4_quant
Test utilities: add convert_swizzled_8x4_layout_to_linear and a layout flag to dequantize_nvfp4_to_dtype
Tests/CI: expand NVFP4 GEMM tests (small-batch shapes, TRTLLM backend), add e2e NVFP4 model test, and run in Blackwell CI; minor CI label/time tweak and env var choices formatting

^{Written by Cursor Bugbot for commit 01a7e5d. This will update automatically on new commits. Configure here.}

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

chatgpt-codex-connector · 2025-12-17T16:33:24Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request introduces a new NVFP4 backend variant with a smaller 8x4 scaling-factor tiling layout, aimed at improving performance for small-batch decoding workloads. The changes are well-structured, touching upon the core quantization logic, environment variable definitions, and associated tests. I've identified a critical issue in compressed_tensors_w4a4_nvfp4.py where an undefined attribute is being used, which would lead to a runtime error. Additionally, there's a minor issue in flashinfer.py concerning an incorrect export. Addressing these points will solidify the implementation.

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py

vllm/utils/flashinfer.py

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

chatgpt-codex-connector

💡 Codex Review

https://github.com/vllm-project/vllm/blob/d3bfcec9e49d6738438764aa487f8f417f2a795f/.#L1
Review blocked by sandbox failure

I could not inspect commit 03c8db0ecb70fdf7fade1f70c9b17ace1a4b935d because every attempt to run shell commands in the workspace fails immediately with a linux-sandbox LandlockRestrict panic, leaving the repository inaccessible. Please rerun the review in an environment where exec access works so the diff can be analyzed.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

pavanimajety · 2025-12-17T23:06:55Z

vllm/envs.py

+            "flashinfer-cudnn",
+            "flashinfer-trtllm",
+            "flashinfer-cutlass",
+            "flashinfer-trtllm_8x4_sf_layout",


I wonder if we should enable this by default and let the autotuner pick the suitable tile size. I'm concerned it may cause unintended confusion to the users.

pavanimajety · 2025-12-17T23:07:50Z

vllm/utils/flashinfer.py

+            g_scale,
+            dtype,
+            block_size=16,
+            use_8x4_sf_layout=use_8x4_sf_layout,


Perhaps make this an automated setting based on when 8x4_sf would be a better choice like A.shape[0] < 32 ?

Yes, based on my benchmarks this is the right choice. I would also make this backend the default automatically in those cases.

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

mergify · 2026-01-02T10:33:14Z

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>

mgoin

LGTM as long as it works with torch.compile. Nice analysis!

mgoin · 2026-01-07T19:20:46Z

vllm/model_executor/layers/quantization/modelopt.py

+        if self.backend == "flashinfer-trtllm" and x.shape[0] <= 32:
+            x_fp4, x_blockscale = flashinfer_quant_nvfp4_8x4_sf_layout(
+                x, layer.input_scale_inv
+            )
+            x_blockscale = x_blockscale.view(torch.float8_e4m3fn)


Maybe we should put this logic inside of scaled_fp4_quant and pass in backend to that function

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

vllm/utils/flashinfer.py

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

mergify · 2026-01-09T14:46:53Z

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

mgoin

LGTM, nice work!

…small-batch decoding (vllm-project#30885) Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

…small-batch decoding (vllm-project#30885) Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…small-batch decoding (vllm-project#30885) Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10: Architecture confirmed: - Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates) - 3 MTP modules present (layers 62-64) — biggest performance lever available - Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10 Performance gap analysis: - Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline - vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap) - Gap sources: activation quant overhead, kernel launch overhead, no fused shuffle+reduce in MoE, generic CUTLASS configs Key new PRs to integrate: - vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4 - vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync - vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling Already-merged PRs confirmed in HEAD: - vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion - vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion - vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)

adding new nvfp4 backend

e0ca865

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

LopezCastroRoberto requested review from DarkLight1337, WoosukKwon, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and ywang96 as code owners December 17, 2025 16:33

LopezCastroRoberto marked this pull request as draft December 17, 2025 16:33

mergify bot added the nvidia label Dec 17, 2025

github-project-automation bot added this to NVIDIA Dec 17, 2025

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py Outdated Show resolved Hide resolved

vllm/utils/flashinfer.py Outdated Show resolved Hide resolved

fixing typos

45c8d87

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

mgoin self-assigned this Dec 17, 2025

fix: apply pre-commit formatting

d3bfcec

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

LopezCastroRoberto marked this pull request as ready for review December 17, 2025 22:36

chatgpt-codex-connector bot reviewed Dec 17, 2025

View reviewed changes

pavanimajety reviewed Dec 17, 2025

View reviewed changes

LopezCastroRoberto and others added 2 commits December 19, 2025 10:27

Merge branch 'main' into feature/nvfp4_8x4_tiling

dfabb02

make 8x4 layout default for bs<=32

c852f04

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

LopezCastroRoberto requested a review from pavanimajety January 2, 2026 10:30

LopezCastroRoberto and others added 2 commits January 2, 2026 11:40

fix: apply pre-commit formatting

f528094

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

Merge branch 'main' into feature/nvfp4_8x4_tiling

e351eee

Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>

mgoin approved these changes Jan 7, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 7, 2026

mgoin added the performance Performance-related issues label Jan 7, 2026

LopezCastroRoberto and others added 2 commits January 8, 2026 13:16

fix: apply pre-commit formatting

fee91f0

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

Merge branch 'main' into feature/nvfp4_8x4_tiling

30ec1f7

cursor bot reviewed Jan 9, 2026

View reviewed changes

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py Outdated Show resolved Hide resolved

LopezCastroRoberto added 2 commits January 9, 2026 03:11

reverting nvfp4 test

6d9b39b

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

change nvfp4 model testing to smaller model to avoid OOM issues

80958f3

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

cursor bot reviewed Jan 9, 2026

View reviewed changes

vllm/utils/flashinfer.py Outdated Show resolved Hide resolved

revert pre-commit change causing torch.compile to break

5e51fee

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

mergify bot added the ci/build label Jan 9, 2026

LopezCastroRoberto and others added 2 commits January 9, 2026 07:06

disable SIM210 rule

798ff53

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>

Merge branch 'main' into feature/nvfp4_8x4_tiling

01a7e5d

cursor bot reviewed Jan 10, 2026

View reviewed changes

...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py Outdated Show resolved Hide resolved

LopezCastroRoberto and others added 2 commits January 12, 2026 17:52

Merge branch 'main' into feature/nvfp4_8x4_tiling

e833be8

addressing comments

36e70cd

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto requested a review from mgoin January 12, 2026 21:31

mgoin approved these changes Jan 13, 2026

View reviewed changes

vllm-bot merged commit 8ef50d9 into vllm-project:main Jan 13, 2026
61 of 64 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 13, 2026

Uh oh!

Conversation

LopezCastroRoberto commented Dec 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Microbenchmark results - Quantization overhead included for FP4 numbers

Notes

Example (10240x8192 weights, nvidia/Llama-3.3-70B-Instruct-NVFP4):

Additional examples:

Preliminary Results

Setup

Small-batch Decode Throughput (FP4 + FP8)

Model: nvidia/Llama-3.3-70B-Instruct

Model: nvidia/Llama-3.1-8B-Instruct

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

pavanimajety Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LopezCastroRoberto Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 2, 2026

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 9, 2026

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LopezCastroRoberto commented Dec 17, 2025 •

edited by github-actions bot

Loading

Example (10240x8192 weights, `nvidia/Llama-3.3-70B-Instruct-NVFP4`):

Model: `nvidia/Llama-3.3-70B-Instruct`

Model: `nvidia/Llama-3.1-8B-Instruct`

pavanimajety Dec 17, 2025 •

edited

Loading

LopezCastroRoberto Dec 19, 2025 •

edited

Loading