Skip to content

[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend#42235

Open
mmangkad wants to merge 1 commit into
vllm-project:mainfrom
mmangkad:add-flashinfer-cutedsl-nvfp4
Open

[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend#42235
mmangkad wants to merge 1 commit into
vllm-project:mainfrom
mmangkad:add-flashinfer-cutedsl-nvfp4

Conversation

@mmangkad
Copy link
Copy Markdown
Contributor

Summary

Adds flashinfer-cutedsl for dense NVFP4 GEMM and makes it the highest-priority CUDA backend when supported on SM10x. In serving benchmarks, cutedsl is fastest across concurrency 1-512 and improves tok/s/user by up to 27.07% over the tested FlashInfer backends.

Performance Comparison

Setup:

  • Model: nvidia/Llama-3.1-8B-Instruct-NVFP4
  • Device: SM103
  • Dataset: random
  • Input/output length: 512 input tokens, 512 output tokens
image

Test Plan

CI, which now includes:

  • Extends the FlashInfer NVFP4 GEMM kernel test with cute-dsl.
  • Extends the NVFP4 model test with flashinfer-cutedsl.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new NVFP4 GEMM backend utilizing FlashInfer's CuteDSL, specifically targeting SM10x architectures. The changes include the implementation of the FlashInferCuteDslNvFp4LinearKernel, its registration within the kernel executor, and the addition of flashinfer-cutedsl as a valid environment variable option. Feedback highlights inconsistencies in the backend naming convention, recommending the use of "cutedsl" instead of "cute-dsl" across the codebase and tests for better alignment with existing backend identifiers.

Comment thread vllm/model_executor/kernels/linear/nvfp4/flashinfer.py
Comment thread tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py Outdated
Comment thread tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
Copy link
Copy Markdown
Contributor

@LopezCastroRoberto LopezCastroRoberto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already started this integration some time ago in #39933, but I don’t think there’s a clear heuristic for deciding when this backend should be selected.

Across the different shapes I tested, the best results were typically in the range 16 <= bs <= 32. Outside of that range, this backend is not consistently the best option and can actually be significantly slower in some cases.

One example is:

Image

where speedup=1 means a different kernel of the existing ones was selected, but cuteDSL causes a regression

@mmangkad
Copy link
Copy Markdown
Contributor Author

I already started this integration some time ago in #39933, but I don’t think there’s a clear heuristic for deciding when this backend should be selected.

Across the different shapes I tested, the best results were typically in the range 16 <= bs <= 32. Outside of that range, this backend is not consistently the best option and can actually be significantly slower in some cases.

One example is:

Image where speedup=1 means a different kernel of the existing ones was selected, but cuteDSL causes a regression

Could you clarify when these SM100 benchmarks were collected and which FlashInfer version was used? Based on my testing, cute-dsl is almost always better than CUTLASS and cuDNN on both SM103 and SM100, and we should actually expect to see a higher relative speedup on SM100 rather than the regressions shown here.

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

@mmangkad Yeah, you have a point, this was with 0.6.8 and might have changed since then. I see the latest FI release is 0.6.11.

Can you please benchmark those shapes in my plot to see how different it looks now? I recommend using triton.testing.do_bench_cudagraph for proper time measurement, e.g., https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_nvfp4_gemm.py

@mmangkad
Copy link
Copy Markdown
Contributor Author

@LopezCastroRoberto see below. I reran those shapes with triton.testing.do_bench_cudagraph and compared cute-dsl directly against FI CUTLASS, which is the current highest-priority NVFP4 backend before this PR. This uses the latest FI release on SM100 and SM103.

nvfp4_gemm_cutedsl_speedup_autotune

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

LopezCastroRoberto commented May 14, 2026

@LopezCastroRoberto see below. I reran those shapes with triton.testing.do_bench_cudagraph and compared cute-dsl directly against FI CUTLASS, which is the current highest-priority NVFP4 backend before this PR. This uses the latest FI release on SM100 and SM103.

nvfp4_gemm_cutedsl_speedup_autotune

Thanks for the results, @mmangkad! Yeah, seems like it might have improved since last time I checked. Just to make sure, would you mind adding flashinfer-trtllm backend to the comparison making sure use_8x4_sf_layout is True when calling flashinfer mm_fp4? We found that for bs<32 that backend was significantly faster than CUTLASS in most cases. See #30885

That way, we would have the full picture and it would be easier to define an heuristic, instead of just adding one more backend to the list.

@mmangkad
Copy link
Copy Markdown
Contributor Author

@LopezCastroRoberto TRTLLM is still strongest at the very smallest M values, especially M=1-4, but CuTeDSL already matches or beats it in many small-M cases and takes over by M=8+. The clearer result is that CuTeDSL is almost always better than the current CUTLASS default across these shapes.

FlashInfer NVFP4 GEMM Results

Each backend cell is TFLOP/s (gap vs best). best is computed within the same device, shape, and M.

Overall Winners

Backend Wins Share
CuTeDSL 73 65.2%
CUTLASS 18 16.1%
TRTLLM 21 18.8%

SM100 winners

Backend Wins Share
CuTeDSL 37 66.1%
CUTLASS 8 14.3%
TRTLLM 11 19.6%

SM103 winners

Backend Wins Share
CuTeDSL 36 64.3%
CUTLASS 10 17.9%
TRTLLM 10 17.9%

SM100

N=7168, K=2048

M CuTeDSL CUTLASS TRTLLM Winner
1 7.75 (-12.61%) 7.59 (-14.39%) 8.87 (best) TRTLLM
2 15.96 (-4.57%) 13.99 (-16.35%) 16.72 (best) TRTLLM
4 30.94 (-6.65%) 28.99 (-12.52%) 33.14 (best) TRTLLM
8 72.41 (best) 60.67 (-16.21%) 70.36 (-2.83%) CuTeDSL
16 144.98 (best) 120.89 (-16.62%) 130.30 (-10.13%) CuTeDSL
32 290.41 (best) 229.20 (-21.08%) 254.27 (-12.44%) CuTeDSL
64 586.55 (best) 460.02 (-21.57%) 372.67 (-36.46%) CuTeDSL
128 1123.69 (best) 957.68 (-14.77%) 724.85 (-35.49%) CuTeDSL
256 1889.30 (best) 1754.17 (-7.15%) 1473.48 (-22.01%) CuTeDSL
512 2757.34 (best) 2718.47 (-1.41%) 1908.06 (-30.80%) CuTeDSL
1024 3355.52 (-2.28%) 3433.64 (best) 2471.99 (-28.01%) CUTLASS
2048 4351.24 (best) 4296.43 (-1.26%) 3198.24 (-26.50%) CuTeDSL
4096 4637.72 (-2.92%) 4777.00 (best) 3491.12 (-26.92%) CUTLASS
8192 4919.04 (best) 4916.80 (-0.05%) 3556.33 (-27.70%) CuTeDSL

N=4096, K=7168

M CuTeDSL CUTLASS TRTLLM Winner
1 9.31 (best) 8.70 (-6.51%) 9.22 (-1.03%) CuTeDSL
2 18.52 (best) 17.75 (-4.13%) 17.86 (-3.55%) CuTeDSL
4 37.14 (best) 36.11 (-2.78%) 35.13 (-5.41%) CuTeDSL
8 77.10 (best) 69.76 (-9.52%) 70.75 (-8.24%) CuTeDSL
16 154.35 (best) 140.15 (-9.20%) 141.89 (-8.07%) CuTeDSL
32 308.70 (best) 280.22 (-9.23%) 280.46 (-9.15%) CuTeDSL
64 619.04 (best) 556.22 (-10.15%) 398.77 (-35.58%) CuTeDSL
128 1221.97 (best) 1097.78 (-10.16%) 779.89 (-36.18%) CuTeDSL
256 2345.46 (best) 2172.83 (-7.36%) 1536.17 (-34.50%) CuTeDSL
512 3953.75 (best) 3819.20 (-3.40%) 2971.94 (-24.83%) CuTeDSL
1024 4827.23 (-0.60%) 4856.49 (best) 3265.48 (-32.76%) CUTLASS
2048 5664.96 (best) 5582.28 (-1.46%) 3647.17 (-35.62%) CuTeDSL
4096 5707.21 (-0.10%) 5713.02 (best) 4242.58 (-25.74%) CUTLASS
8192 6151.10 (best) 5991.68 (-2.59%) 3931.82 (-36.08%) CuTeDSL

N=18432, K=7168

M CuTeDSL CUTLASS TRTLLM Winner
1 23.81 (-31.82%) 19.16 (-45.12%) 34.91 (best) TRTLLM
2 48.89 (-30.44%) 47.55 (-32.35%) 70.29 (best) TRTLLM
4 99.59 (-28.17%) 102.02 (-26.42%) 138.66 (best) TRTLLM
8 232.06 (-19.13%) 206.99 (-27.86%) 286.94 (best) TRTLLM
16 462.80 (-16.96%) 408.13 (-26.77%) 557.33 (best) TRTLLM
32 929.33 (best) 825.79 (-11.14%) 553.18 (-40.48%) CuTeDSL
64 2254.95 (best) 1712.88 (-24.04%) 898.07 (-60.17%) CuTeDSL
128 3832.20 (best) 3373.04 (-11.98%) 3054.55 (-20.29%) CuTeDSL
256 5070.73 (best) 4942.17 (-2.54%) 3256.12 (-35.79%) CuTeDSL
512 5359.51 (best) 5307.24 (-0.98%) 3621.65 (-32.43%) CuTeDSL
1024 5506.23 (best) 5430.92 (-1.37%) 3673.10 (-33.29%) CuTeDSL
2048 5779.01 (best) 5636.57 (-2.46%) 3384.32 (-41.44%) CuTeDSL
4096 5466.57 (-4.88%) 5746.94 (best) 3473.93 (-39.55%) CUTLASS
8192 5681.96 (best) 5483.52 (-3.49%) 3490.44 (-38.57%) CuTeDSL

N=7168, K=18432

M CuTeDSL CUTLASS TRTLLM Winner
1 17.29 (-12.04%) 15.99 (-18.66%) 19.66 (best) TRTLLM
2 37.68 (-2.86%) 31.97 (-17.58%) 38.79 (best) TRTLLM
4 73.76 (-6.20%) 63.85 (-18.79%) 78.63 (best) TRTLLM
8 172.87 (best) 128.22 (-25.83%) 156.05 (-9.73%) CuTeDSL
16 345.76 (best) 255.74 (-26.04%) 310.93 (-10.07%) CuTeDSL
32 660.78 (best) 477.45 (-27.74%) 591.83 (-10.44%) CuTeDSL
64 1377.38 (best) 1033.90 (-24.94%) 862.33 (-37.39%) CuTeDSL
128 2582.43 (best) 1897.42 (-26.53%) 1702.98 (-34.05%) CuTeDSL
256 4570.02 (best) 4432.55 (-3.01%) 2802.01 (-38.69%) CuTeDSL
512 4834.91 (-0.37%) 4852.64 (best) 2930.76 (-39.60%) CUTLASS
1024 5372.21 (best) 5212.10 (-2.98%) 3034.59 (-43.51%) CuTeDSL
2048 5516.94 (best) 5284.15 (-4.22%) 3288.94 (-40.38%) CuTeDSL
4096 5572.54 (-3.29%) 5762.37 (best) 3350.80 (-41.85%) CUTLASS
8192 5002.21 (-8.12%) 5444.40 (best) 3707.88 (-31.90%) CUTLASS

SM103

N=7168, K=2048

M CuTeDSL CUTLASS TRTLLM Winner
1 8.26 (-11.02%) 7.72 (-16.91%) 9.29 (best) TRTLLM
2 16.80 (-2.49%) 14.93 (-13.33%) 17.23 (best) TRTLLM
4 32.54 (-7.83%) 29.90 (-15.30%) 35.30 (best) TRTLLM
8 75.32 (best) 59.74 (-20.68%) 69.53 (-7.69%) CuTeDSL
16 146.08 (best) 120.41 (-17.58%) 139.20 (-4.71%) CuTeDSL
32 304.86 (best) 239.83 (-21.33%) 257.04 (-15.68%) CuTeDSL
64 616.77 (best) 479.07 (-22.33%) 394.67 (-36.01%) CuTeDSL
128 1141.56 (best) 963.42 (-15.60%) 763.52 (-33.12%) CuTeDSL
256 2039.46 (best) 1861.99 (-8.70%) 1479.90 (-27.44%) CuTeDSL
512 2988.48 (best) 2736.61 (-8.43%) 1983.47 (-33.63%) CuTeDSL
1024 3553.24 (-2.43%) 3641.57 (best) 2563.44 (-29.61%) CUTLASS
2048 4550.94 (best) 4392.35 (-3.48%) 3245.54 (-28.68%) CuTeDSL
4096 4920.28 (best) 4894.33 (-0.53%) 3558.20 (-27.68%) CuTeDSL
8192 5148.52 (-1.64%) 5234.12 (best) 3413.34 (-34.79%) CUTLASS

N=4096, K=7168

M CuTeDSL CUTLASS TRTLLM Winner
1 9.03 (-3.06%) 8.26 (-11.37%) 9.31 (best) TRTLLM
2 19.18 (best) 17.22 (-10.23%) 18.43 (-3.89%) CuTeDSL
4 38.59 (best) 37.81 (-2.03%) 36.83 (-4.56%) CuTeDSL
8 81.02 (best) 73.13 (-9.74%) 73.80 (-8.91%) CuTeDSL
16 161.90 (best) 146.44 (-9.55%) 147.29 (-9.03%) CuTeDSL
32 324.05 (best) 293.18 (-9.53%) 292.25 (-9.81%) CuTeDSL
64 653.75 (best) 584.33 (-10.62%) 412.80 (-36.86%) CuTeDSL
128 1274.86 (best) 1151.05 (-9.71%) 807.53 (-36.66%) CuTeDSL
256 2487.46 (best) 2320.52 (-6.71%) 1598.98 (-35.72%) CuTeDSL
512 4331.05 (best) 4087.08 (-5.63%) 3075.78 (-28.98%) CuTeDSL
1024 5160.79 (best) 5035.26 (-2.43%) 3388.57 (-34.34%) CuTeDSL
2048 5865.40 (best) 5856.70 (-0.15%) 3677.14 (-37.31%) CuTeDSL
4096 6209.12 (-3.70%) 6447.84 (best) 4181.73 (-35.15%) CUTLASS
8192 5629.43 (best) 5615.45 (-0.25%) 3947.02 (-29.89%) CuTeDSL

N=18432, K=7168

M CuTeDSL CUTLASS TRTLLM Winner
1 25.02 (-33.65%) 21.57 (-42.79%) 37.71 (best) TRTLLM
2 50.75 (-31.76%) 47.81 (-35.72%) 74.37 (best) TRTLLM
4 102.43 (-31.62%) 100.24 (-33.08%) 149.80 (best) TRTLLM
8 246.51 (-19.50%) 208.99 (-31.75%) 306.21 (best) TRTLLM
16 483.16 (-18.65%) 412.14 (-30.60%) 593.90 (best) TRTLLM
32 963.08 (best) 832.79 (-13.53%) 585.98 (-39.16%) CuTeDSL
64 2388.24 (best) 1765.84 (-26.06%) 938.92 (-60.69%) CuTeDSL
128 4138.86 (best) 3577.81 (-13.56%) 3203.88 (-22.59%) CuTeDSL
256 5331.70 (best) 5207.50 (-2.33%) 3343.82 (-37.28%) CuTeDSL
512 5786.16 (best) 5264.24 (-9.02%) 3680.79 (-36.39%) CuTeDSL
1024 6007.93 (best) 5940.19 (-1.13%) 3569.01 (-40.59%) CuTeDSL
2048 5535.25 (best) 5356.93 (-3.22%) 3455.75 (-37.57%) CuTeDSL
4096 5276.37 (-1.82%) 5374.20 (best) 3551.15 (-33.92%) CUTLASS
8192 5346.58 (-4.00%) 5569.13 (best) 3568.93 (-35.92%) CUTLASS

N=7168, K=18432

M CuTeDSL CUTLASS TRTLLM Winner
1 17.97 (-11.04%) 16.87 (-16.48%) 20.20 (best) TRTLLM
2 41.76 (best) 33.07 (-20.81%) 40.28 (-3.53%) CuTeDSL
4 82.51 (best) 65.98 (-20.02%) 79.96 (-3.08%) CuTeDSL
8 177.90 (best) 132.71 (-25.40%) 161.40 (-9.28%) CuTeDSL
16 353.05 (best) 280.15 (-20.65%) 317.09 (-10.19%) CuTeDSL
32 704.78 (best) 551.51 (-21.75%) 621.58 (-11.80%) CuTeDSL
64 1368.03 (best) 1074.71 (-21.44%) 779.15 (-43.05%) CuTeDSL
128 2654.48 (best) 2082.84 (-21.53%) 1739.80 (-34.46%) CuTeDSL
256 4591.57 (best) 4442.42 (-3.25%) 2904.12 (-36.75%) CuTeDSL
512 5248.23 (-0.15%) 5255.86 (best) 3039.43 (-42.17%) CUTLASS
1024 5598.81 (-6.27%) 5973.28 (best) 2949.47 (-50.62%) CUTLASS
2048 5114.07 (-8.83%) 5609.35 (best) 3206.56 (-42.84%) CUTLASS
4096 5386.57 (-8.61%) 5893.87 (best) 3284.85 (-44.27%) CUTLASS
8192 4975.49 (-14.80%) 5839.75 (best) 3264.02 (-44.11%) CUTLASS

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

LopezCastroRoberto commented May 18, 2026

Thanks for the results, @mmangkad. Yeah, I think this makes sense.

We should also update the FI version to the latest, i.e. 0.6.11.post3?
https://github.com/LopezCastroRoberto/vllm/blob/main/requirements/cuda.txt

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

cc: @mgoin

@mmangkad
Copy link
Copy Markdown
Contributor Author

mmangkad commented May 18, 2026

Thanks for the results, @mmangkad. Yeah, I think this makes sense.

We should also update the FI version to the latest, i.e. 0.6.11.post3? https://github.com/LopezCastroRoberto/vllm/blob/main/requirements/cuda.txt

@LopezCastroRoberto we are already at 0.6.11.post2 now, but I think we can include upgrade to 0.6.11.post3 here if you prefer that

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

LopezCastroRoberto commented May 18, 2026

Nevermind, my bad. I accidentally checked my own fork instead of upstream. Waiting for @mgoin approval.

@mmangkad mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch 2 times, most recently from 3b67dbc to d2c176d Compare May 18, 2026 11:41
@mmangkad mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch from d2c176d to 3700b17 Compare May 18, 2026 11:44
@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

Worth being aware of, BTW: flashinfer-ai/flashinfer#3295

@mmangkad
Copy link
Copy Markdown
Contributor Author

Rebased after resolving conflicts caused by #39538 and aligning with its changes

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
@mmangkad mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch from 3700b17 to c34ce76 Compare May 23, 2026 04:31
@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

LopezCastroRoberto commented May 28, 2026

@mmangkad -- following up on the FlashInfer autotuning issue I flagged earlier (flashinfer-ai/flashinfer#3295). The discussion has progressed and there's now a concrete fix, so wanted to share the conclusions since they directly affect this PR.

Right now vLLM defaults to O2, which has enable_flashinfer_autotune=True (re-enabled in #42857). Once this PR lands making cute-dsl the highest-priority NVFP4 backend, every default NVFP4 deployment will autotune mm_fp4 cuteDSL kernels at startup.

Interestingly, seems like autotuning mm_fp4 cuteDSL is unnecessary. PR flashinfer-ai/flashinfer#2940 added a heuristic that closes the autotuned vs non-autotuned perf gap. The heuristic predicts the best config for each (N, K) combination in <100us on first call, and subsequent lookups are <0.2us.

To fix this, flashinfer-ai/flashinfer#3396 adds a skip_ops mechanism:

with flashinfer.autotune(skip_ops="fp4_gemm"):
      ...

This brought warmup from 587s → 8s on DSV3.2-NVFP4 TP=4.

I think we should track a follow-up to integrate skip_ops when the FI version is bumped to include #3396, and then merge this PR too.

cc: @mgoin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants