[Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding#30885
Conversation
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces a new NVFP4 backend variant with a smaller 8x4 scaling-factor tiling layout, aimed at improving performance for small-batch decoding workloads. The changes are well-structured, touching upon the core quantization logic, environment variable definitions, and associated tests. I've identified a critical issue in compressed_tensors_w4a4_nvfp4.py where an undefined attribute is being used, which would lead to a runtime error. Additionally, there's a minor issue in flashinfer.py concerning an incorrect export. Addressing these points will solidify the implementation.
...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py
Outdated
Show resolved
Hide resolved
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
There was a problem hiding this comment.
💡 Codex Review
https://github.com/vllm-project/vllm/blob/d3bfcec9e49d6738438764aa487f8f417f2a795f/.#L1
Review blocked by sandbox failure
I could not inspect commit 03c8db0ecb70fdf7fade1f70c9b17ace1a4b935d because every attempt to run shell commands in the workspace fails immediately with a linux-sandbox LandlockRestrict panic, leaving the repository inaccessible. Please rerun the review in an environment where exec access works so the diff can be analyzed.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/envs.py
Outdated
| "flashinfer-cudnn", | ||
| "flashinfer-trtllm", | ||
| "flashinfer-cutlass", | ||
| "flashinfer-trtllm_8x4_sf_layout", |
There was a problem hiding this comment.
I wonder if we should enable this by default and let the autotuner pick the suitable tile size. I'm concerned it may cause unintended confusion to the users.
| g_scale, | ||
| dtype, | ||
| block_size=16, | ||
| use_8x4_sf_layout=use_8x4_sf_layout, |
There was a problem hiding this comment.
Perhaps make this an automated setting based on when 8x4_sf would be a better choice like A.shape[0] < 32 ?
There was a problem hiding this comment.
Yes, based on my benchmarks this is the right choice. I would also make this backend the default automatically in those cases.
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
|
Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
mgoin
left a comment
There was a problem hiding this comment.
LGTM as long as it works with torch.compile. Nice analysis!
| if self.backend == "flashinfer-trtllm" and x.shape[0] <= 32: | ||
| x_fp4, x_blockscale = flashinfer_quant_nvfp4_8x4_sf_layout( | ||
| x, layer.input_scale_inv | ||
| ) | ||
| x_blockscale = x_blockscale.view(torch.float8_e4m3fn) |
There was a problem hiding this comment.
Maybe we should put this logic inside of scaled_fp4_quant and pass in backend to that function
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py
Outdated
Show resolved
Hide resolved
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
|
Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es>
...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py
Outdated
Show resolved
Hide resolved
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
…small-batch decoding (vllm-project#30885) Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
…small-batch decoding (vllm-project#30885) Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
…small-batch decoding (vllm-project#30885) Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…small-batch decoding (vllm-project#30885) Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es> Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10: Architecture confirmed: - Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates) - 3 MTP modules present (layers 62-64) — biggest performance lever available - Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10 Performance gap analysis: - Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline - vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap) - Gap sources: activation quant overhead, kernel launch overhead, no fused shuffle+reduce in MoE, generic CUTLASS configs Key new PRs to integrate: - vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4 - vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync - vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling Already-merged PRs confirmed in HEAD: - vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion - vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion - vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10: Architecture confirmed: - Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates) - 3 MTP modules present (layers 62-64) — biggest performance lever available - Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10 Performance gap analysis: - Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline - vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap) - Gap sources: activation quant overhead, kernel launch overhead, no fused shuffle+reduce in MoE, generic CUTLASS configs Key new PRs to integrate: - vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4 - vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync - vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling Already-merged PRs confirmed in HEAD: - vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion - vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion - vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
Summary
This PR adds an opt-in NVFP4 backend variant that uses smaller scaling-factor tiling (8x4 SF layout). The change targets small-concurrency decode workloads and delivers ~25–35% higher output token throughput compared to the current best NVFP4 backend at small batch sizes.
The backend is automatically selected when:
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-trtllmand
batch size <= 32.Microbenchmark results - Quantization overhead included for FP4 numbers
Notes
Example (10240x8192 weights,
nvidia/Llama-3.3-70B-Instruct-NVFP4):Additional examples:
Model:
meta-llama/Llama-3.1-8B-InstructMetric: TFLOP/s
Problem size: N = 28672, K = 4096
Model:
meta-llama/Llama-3.3-70B-InstructMetric: TFLOP/s
Problem size: N = 8192, K = 8192
trtllm_8x4_sf-nvfp4is consistently best at batch ≤ 16, and sometimesbs=32, which aligns with the target decode regime.Preliminary Results
Setup
--max-concurrency=1Small-batch Decode Throughput (FP4 + FP8)
Model:
nvidia/Llama-3.3-70B-InstructModel:
nvidia/Llama-3.1-8B-InstructRelative to the current best NVFP4 baseline (
flashinfer-cutlass):Relative to the current FP8 baseline:
Full logs for
flashinfer-trtllm_8x4_sfbelownvidia/Llama-3.3-70B-Instruct-NVFP4:nvidia/Llama-3.1-8B-Instruct-NVFP4Note
Optimizes NVFP4 small-batch decode via a smaller scaling-factor tiling and integrates it end-to-end.
mm_fp4withuse_8x4_sf_layout; add custom opvllm::flashinfer_nvfp4_quantizeand helperflashinfer_quant_nvfp4_8x4_sf_layout;flashinfer_scaled_fp4_mmauto-enables 8x4 SF fortrtllmwhen rows ≤ 32compressed_tensors_w4a4_nvfp4.py(and modelopt) auto-switch to 8x4 SF quantization forflashinfer-trtllmsmall inputs; otherwise usescaled_fp4_quantconvert_swizzled_8x4_layout_to_linearand a layout flag todequantize_nvfp4_to_dtypeWritten by Cursor Bugbot for commit 01a7e5d. This will update automatically on new commits. Configure here.