Skip to content

[Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding#30885

Merged
vllm-bot merged 17 commits intovllm-project:mainfrom
LopezCastroRoberto:feature/nvfp4_8x4_tiling
Jan 13, 2026
Merged

[Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding#30885
vllm-bot merged 17 commits intovllm-project:mainfrom
LopezCastroRoberto:feature/nvfp4_8x4_tiling

Conversation

@LopezCastroRoberto
Copy link
Contributor

@LopezCastroRoberto LopezCastroRoberto commented Dec 17, 2025

Summary

This PR adds an opt-in NVFP4 backend variant that uses smaller scaling-factor tiling (8x4 SF layout). The change targets small-concurrency decode workloads and delivers ~25–35% higher output token throughput compared to the current best NVFP4 backend at small batch sizes.

Note: this backend is not recommended for medium or large batch sizes. Benefits are limited to the small-batch decode regime - see below.

The backend is automatically selected when:

export VLLM_NVFP4_GEMM_BACKEND=flashinfer-trtllm

and batch size <= 32.

Microbenchmark results - Quantization overhead included for FP4 numbers

Notes

  • All results except FBGEMM are FlashInfer implementations
  • Autotuning enabled for all runs

Example (10240x8192 weights, nvidia/Llama-3.3-70B-Instruct-NVFP4):

quant_small_bs quant

Additional examples:

Model: meta-llama/Llama-3.1-8B-Instruct
Metric: TFLOP/s
Problem size: N = 28672, K = 4096

Batch Size torch-bf16 cutlass-nvfp4 trtllm-nvfp4 cudnn-nvfp4 trtllm_8x4_sf-nvfp4 (new) fbgemm-nvfp4
1 5.97 13.45 8.32 7.80 20.42 12.88
2 11.93 30.99 17.92 17.85 41.60 25.46
4 23.85 67.47 36.54 43.55 83.56 60.14
8 47.41 141.56 77.58 92.25 164.95 119.05
16 94.05 285.90 161.63 184.94 322.11 236.75
32 182.10 575.06 334.03 373.74 369.62 461.51
64 357.33 1146.03 715.13 698.12 448.99 1001.14
128 607.87 1946.30 1851.35 1192.37 508.45 2015.76
256 1031.79 2839.15 2308.84 2583.32 541.70 2858.62
512 1148.44 3490.14 2840.43 2961.75 558.83 2954.76
1024 1131.31 4557.48 3029.03 3564.80 568.11 3489.49
2048 1294.41 4911.02 3152.24 3635.49 570.32 3857.35
4096 1365.47 4819.24 3154.10 3748.01 574.34 3868.30
8192 1446.52 4958.41 3170.63 3844.16 577.77 3901.14
16384 1451.02 4798.02 3250.96 4069.28 578.47 3938.36

Model: meta-llama/Llama-3.3-70B-Instruct
Metric: TFLOP/s
Problem size: N = 8192, K = 8192

Batch Size torch-bf16 cutlass-nvfp4 trtllm-nvfp4 cudnn-nvfp4 trtllm_8x4_sf-nvfp4 (new) fbgemm-nvfp4
1 5.72 7.92 5.37 5.61 14.08 10.52
2 11.41 18.32 12.16 15.01 28.08 22.54
4 22.86 39.38 25.59 32.04 56.19 44.83
8 45.65 82.76 54.63 66.37 111.44 89.63
16 84.19 167.37 112.83 134.32 219.55 175.97
32 165.89 337.50 243.26 268.83 432.75 347.74
64 320.00 671.20 550.12 523.06 501.44 684.05
128 656.17 1233.54 1248.79 953.45 553.85 1313.27
256 1029.73 1955.48 2319.43 1882.18 645.24 2580.77
512 1315.83 3125.66 2609.10 3303.98 667.88 3365.54
1024 1345.53 3641.77 2636.36 3851.73 685.73 3578.31
2048 1335.88 4185.16 3166.94 3849.99 699.08 3181.51
4096 1345.35 4580.09 3213.39 4049.13 700.62 4062.80
8192 1441.36 4625.21 3292.62 4178.33 709.65 3890.69
16384 1452.46 4715.90 3381.17 4239.30 712.55 3930.54

trtllm_8x4_sf-nvfp4 is consistently best at batch ≤ 16, and sometimes bs=32, which aligns with the target decode regime.

Preliminary Results

Setup

  • GPU: B200
  • TP size: 1
  • Concurrency: --max-concurrency=1

Small-batch Decode Throughput (FP4 + FP8)

Model: nvidia/Llama-3.3-70B-Instruct

Precision Backend Output Token Throughput (tok/s) Mean TPOT (ms)
NVFP4 flashinfer-cutlass 67.13 14.89
NVFP4 flashinfer-cudnn 54.75 18.26
NVFP4 flashinfer-trtllm 48.09 20.79
NVFP4 flashinfer-trtllm_8x4_sf (new) 83.89 11.91
FP8 (reference) 64.59 15.48

Model: nvidia/Llama-3.1-8B-Instruct

Precision Backend Output Token Throughput (tok/s) Mean TPOT (ms)
NVFP4 flashinfer-cutlass 240.32 4.15
NVFP4 flashinfer-trtllm_8x4_sf (new) 327.69 3.04
FP8 (reference) 283.36 3.52

Relative to the current best NVFP4 baseline (flashinfer-cutlass):

  • ~25–35% higher output token throughput at small batch / low concurrency

Relative to the current FP8 baseline:

  • ~15–30% higher output token throughput at small batch / low concurrency

Full logs for flashinfer-trtllm_8x4_sf below

nvidia/Llama-3.3-70B-Instruct-NVFP4:

============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  24.41     
Total input tokens:                      2         
Total generated tokens:                  2048      
Request throughput (req/s):              0.08      
Output token throughput (tok/s):         83.89     
Peak output token throughput (tok/s):    87.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          83.97     
---------------Time to First Token----------------
Mean TTFT (ms):                          22.18     
Median TTFT (ms):                        22.18     
P99 TTFT (ms):                           22.97     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.91     
Median TPOT (ms):                        11.91     
P99 TPOT (ms):                           12.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.91     
Median ITL (ms):                         11.54     
P99 ITL (ms):                            19.34     
==================================================

nvidia/Llama-3.1-8B-Instruct-NVFP4

============ Serving Benchmark Result ============
Successful requests:                     2         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  6.25      
Total input tokens:                      2         
Total generated tokens:                  2048      
Request throughput (req/s):              0.32      
Output token throughput (tok/s):         327.69    
Peak output token throughput (tok/s):    329.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          328.01    
---------------Time to First Token----------------
Mean TTFT (ms):                          10.21     
Median TTFT (ms):                        10.21     
P99 TTFT (ms):                           10.75     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.04      
Median TPOT (ms):                        3.04      
P99 TPOT (ms):                           3.05      
---------------Inter-token Latency----------------
Mean ITL (ms):                           3.04      
Median ITL (ms):                         3.04      
P99 ITL (ms):                            3.39      
==================================================

Note

Optimizes NVFP4 small-batch decode via a smaller scaling-factor tiling and integrates it end-to-end.

  • FlashInfer: extend mm_fp4 with use_8x4_sf_layout; add custom op vllm::flashinfer_nvfp4_quantize and helper flashinfer_quant_nvfp4_8x4_sf_layout; flashinfer_scaled_fp4_mm auto-enables 8x4 SF for trtllm when rows ≤ 32
  • Linear paths: in compressed_tensors_w4a4_nvfp4.py (and modelopt) auto-switch to 8x4 SF quantization for flashinfer-trtllm small inputs; otherwise use scaled_fp4_quant
  • Test utilities: add convert_swizzled_8x4_layout_to_linear and a layout flag to dequantize_nvfp4_to_dtype
  • Tests/CI: expand NVFP4 GEMM tests (small-batch shapes, TRTLLM backend), add e2e NVFP4 model test, and run in Blackwell CI; minor CI label/time tweak and env var choices formatting

Written by Cursor Bugbot for commit 01a7e5d. This will update automatically on new commits. Configure here.


Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new NVFP4 backend variant with a smaller 8x4 scaling-factor tiling layout, aimed at improving performance for small-batch decoding workloads. The changes are well-structured, touching upon the core quantization logic, environment variable definitions, and associated tests. I've identified a critical issue in compressed_tensors_w4a4_nvfp4.py where an undefined attribute is being used, which would lead to a runtime error. Additionally, there's a minor issue in flashinfer.py concerning an incorrect export. Addressing these points will solidify the implementation.

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
@mgoin mgoin self-assigned this Dec 17, 2025
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
@LopezCastroRoberto LopezCastroRoberto marked this pull request as ready for review December 17, 2025 22:36
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/vllm-project/vllm/blob/d3bfcec9e49d6738438764aa487f8f417f2a795f/.#L1
P0 Badge Review blocked by sandbox failure

I could not inspect commit 03c8db0ecb70fdf7fade1f70c9b17ace1a4b935d because every attempt to run shell commands in the workspace fails immediately with a linux-sandbox LandlockRestrict panic, leaving the repository inaccessible. Please rerun the review in an environment where exec access works so the diff can be analyzed.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/envs.py Outdated
"flashinfer-cudnn",
"flashinfer-trtllm",
"flashinfer-cutlass",
"flashinfer-trtllm_8x4_sf_layout",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should enable this by default and let the autotuner pick the suitable tile size. I'm concerned it may cause unintended confusion to the users.

g_scale,
dtype,
block_size=16,
use_8x4_sf_layout=use_8x4_sf_layout,
Copy link
Collaborator

@pavanimajety pavanimajety Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make this an automated setting based on when 8x4_sf would be a better choice like A.shape[0] < 32 ?

Copy link
Contributor Author

@LopezCastroRoberto LopezCastroRoberto Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, based on my benchmarks this is the right choice. I would also make this backend the default automatically in those cases.

LopezCastroRoberto and others added 2 commits December 19, 2025 10:27
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
@mergify
Copy link

mergify bot commented Jan 2, 2026

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

LopezCastroRoberto and others added 2 commits January 2, 2026 11:40
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as it works with torch.compile. Nice analysis!

Comment on lines +1368 to +1372
if self.backend == "flashinfer-trtllm" and x.shape[0] <= 32:
x_fp4, x_blockscale = flashinfer_quant_nvfp4_8x4_sf_layout(
x, layer.input_scale_inv
)
x_blockscale = x_blockscale.view(torch.float8_e4m3fn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put this logic inside of scaled_fp4_quant and pass in backend to that function

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 7, 2026
@mgoin mgoin added the performance Performance-related issues label Jan 7, 2026
LopezCastroRoberto and others added 2 commits January 8, 2026 13:16
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
@mergify mergify bot added the ci/build label Jan 9, 2026
@mergify
Copy link

mergify bot commented Jan 9, 2026

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

LopezCastroRoberto and others added 2 commits January 9, 2026 07:06
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
LopezCastroRoberto and others added 2 commits January 12, 2026 17:52
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

@vllm-bot vllm-bot merged commit 8ef50d9 into vllm-project:main Jan 13, 2026
61 of 64 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 13, 2026
sammysun0711 pushed a commit to sammysun0711/vllm that referenced this pull request Jan 16, 2026
…small-batch decoding (vllm-project#30885)

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…small-batch decoding (vllm-project#30885)

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…small-batch decoding (vllm-project#30885)

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…small-batch decoding (vllm-project#30885)

Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 2, 2026
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:

Architecture confirmed:
- Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates)
- 3 MTP modules present (layers 62-64) — biggest performance lever available
- Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10

Performance gap analysis:
- Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline
- vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap)
- Gap sources: activation quant overhead, kernel launch overhead, no fused
  shuffle+reduce in MoE, generic CUTLASS configs

Key new PRs to integrate:
- vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4
- vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync
- vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling

Already-merged PRs confirmed in HEAD:
- vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion
- vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion
- vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 4, 2026
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:

Architecture confirmed:
- Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates)
- 3 MTP modules present (layers 62-64) — biggest performance lever available
- Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10

Performance gap analysis:
- Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline
- vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap)
- Gap sources: activation quant overhead, kernel launch overhead, no fused
  shuffle+reduce in MoE, generic CUTLASS configs

Key new PRs to integrate:
- vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4
- vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync
- vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling

Already-merged PRs confirmed in HEAD:
- vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion
- vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion
- vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants