[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend by mmangkad · Pull Request #42235 · vllm-project/vllm

mmangkad · 2026-05-10T15:27:17Z

Summary

Adds flashinfer-cutedsl for dense NVFP4 GEMM and makes it the highest-priority CUDA backend when supported on SM10x. In serving benchmarks, cutedsl is fastest across concurrency 1-512 and improves tok/s/user by up to 27.07% over the tested FlashInfer backends.

Performance Comparison

Setup:

Model: nvidia/Llama-3.1-8B-Instruct-NVFP4
Device: SM103
Dataset: random
Input/output length: 512 input tokens, 512 output tokens

Test Plan

CI, which now includes:

Extends the FlashInfer NVFP4 GEMM kernel test with cute-dsl.
Extends the NVFP4 model test with flashinfer-cutedsl.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request introduces a new NVFP4 GEMM backend utilizing FlashInfer's CuteDSL, specifically targeting SM10x architectures. The changes include the implementation of the FlashInferCuteDslNvFp4LinearKernel, its registration within the kernel executor, and the addition of flashinfer-cutedsl as a valid environment variable option. Feedback highlights inconsistencies in the backend naming convention, recommending the use of "cutedsl" instead of "cute-dsl" across the codebase and tests for better alignment with existing backend identifiers.

LopezCastroRoberto

I already started this integration some time ago in #39933, but I don’t think there’s a clear heuristic for deciding when this backend should be selected.

Across the different shapes I tested, the best results were typically in the range 16 <= bs <= 32. Outside of that range, this backend is not consistently the best option and can actually be significantly slower in some cases.

One example is:

where speedup=1 means a different kernel of the existing ones was selected, but cuteDSL causes a regression

mmangkad · 2026-05-13T16:37:24Z

I already started this integration some time ago in #39933, but I don’t think there’s a clear heuristic for deciding when this backend should be selected.

Across the different shapes I tested, the best results were typically in the range 16 <= bs <= 32. Outside of that range, this backend is not consistently the best option and can actually be significantly slower in some cases.

One example is:
where speedup=1 means a different kernel of the existing ones was selected, but cuteDSL causes a regression

Could you clarify when these SM100 benchmarks were collected and which FlashInfer version was used? Based on my testing, cute-dsl is almost always better than CUTLASS and cuDNN on both SM103 and SM100, and we should actually expect to see a higher relative speedup on SM100 rather than the regressions shown here.

LopezCastroRoberto · 2026-05-13T16:48:22Z

@mmangkad Yeah, you have a point, this was with 0.6.8 and might have changed since then. I see the latest FI release is 0.6.11.

Can you please benchmark those shapes in my plot to see how different it looks now? I recommend using triton.testing.do_bench_cudagraph for proper time measurement, e.g., https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_nvfp4_gemm.py

mmangkad · 2026-05-13T18:35:13Z

@LopezCastroRoberto see below. I reran those shapes with triton.testing.do_bench_cudagraph and compared cute-dsl directly against FI CUTLASS, which is the current highest-priority NVFP4 backend before this PR. This uses the latest FI release on SM100 and SM103.

LopezCastroRoberto · 2026-05-14T13:34:30Z

@LopezCastroRoberto see below. I reran those shapes with triton.testing.do_bench_cudagraph and compared cute-dsl directly against FI CUTLASS, which is the current highest-priority NVFP4 backend before this PR. This uses the latest FI release on SM100 and SM103.

Thanks for the results, @mmangkad! Yeah, seems like it might have improved since last time I checked. Just to make sure, would you mind adding flashinfer-trtllm backend to the comparison making sure use_8x4_sf_layout is True when calling flashinfer mm_fp4? We found that for bs<32 that backend was significantly faster than CUTLASS in most cases. See #30885

That way, we would have the full picture and it would be easier to define an heuristic, instead of just adding one more backend to the list.

mmangkad · 2026-05-14T14:07:16Z

@LopezCastroRoberto TRTLLM is still strongest at the very smallest M values, especially M=1-4, but CuTeDSL already matches or beats it in many small-M cases and takes over by M=8+. The clearer result is that CuTeDSL is almost always better than the current CUTLASS default across these shapes.

FlashInfer NVFP4 GEMM Results

Each backend cell is TFLOP/s (gap vs best). best is computed within the same device, shape, and M.

Overall Winners

Backend	Wins	Share
CuTeDSL	73	65.2%
CUTLASS	18	16.1%
TRTLLM	21	18.8%

SM100 winners

Backend	Wins	Share
CuTeDSL	37	66.1%
CUTLASS	8	14.3%
TRTLLM	11	19.6%

SM103 winners

Backend	Wins	Share
CuTeDSL	36	64.3%
CUTLASS	10	17.9%
TRTLLM	10	17.9%

SM100

N=7168, K=2048

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	7.75 (-12.61%)	7.59 (-14.39%)	8.87 (best)	TRTLLM
2	15.96 (-4.57%)	13.99 (-16.35%)	16.72 (best)	TRTLLM
4	30.94 (-6.65%)	28.99 (-12.52%)	33.14 (best)	TRTLLM
8	72.41 (best)	60.67 (-16.21%)	70.36 (-2.83%)	CuTeDSL
16	144.98 (best)	120.89 (-16.62%)	130.30 (-10.13%)	CuTeDSL
32	290.41 (best)	229.20 (-21.08%)	254.27 (-12.44%)	CuTeDSL
64	586.55 (best)	460.02 (-21.57%)	372.67 (-36.46%)	CuTeDSL
128	1123.69 (best)	957.68 (-14.77%)	724.85 (-35.49%)	CuTeDSL
256	1889.30 (best)	1754.17 (-7.15%)	1473.48 (-22.01%)	CuTeDSL
512	2757.34 (best)	2718.47 (-1.41%)	1908.06 (-30.80%)	CuTeDSL
1024	3355.52 (-2.28%)	3433.64 (best)	2471.99 (-28.01%)	CUTLASS
2048	4351.24 (best)	4296.43 (-1.26%)	3198.24 (-26.50%)	CuTeDSL
4096	4637.72 (-2.92%)	4777.00 (best)	3491.12 (-26.92%)	CUTLASS
8192	4919.04 (best)	4916.80 (-0.05%)	3556.33 (-27.70%)	CuTeDSL

N=4096, K=7168

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	9.31 (best)	8.70 (-6.51%)	9.22 (-1.03%)	CuTeDSL
2	18.52 (best)	17.75 (-4.13%)	17.86 (-3.55%)	CuTeDSL
4	37.14 (best)	36.11 (-2.78%)	35.13 (-5.41%)	CuTeDSL
8	77.10 (best)	69.76 (-9.52%)	70.75 (-8.24%)	CuTeDSL
16	154.35 (best)	140.15 (-9.20%)	141.89 (-8.07%)	CuTeDSL
32	308.70 (best)	280.22 (-9.23%)	280.46 (-9.15%)	CuTeDSL
64	619.04 (best)	556.22 (-10.15%)	398.77 (-35.58%)	CuTeDSL
128	1221.97 (best)	1097.78 (-10.16%)	779.89 (-36.18%)	CuTeDSL
256	2345.46 (best)	2172.83 (-7.36%)	1536.17 (-34.50%)	CuTeDSL
512	3953.75 (best)	3819.20 (-3.40%)	2971.94 (-24.83%)	CuTeDSL
1024	4827.23 (-0.60%)	4856.49 (best)	3265.48 (-32.76%)	CUTLASS
2048	5664.96 (best)	5582.28 (-1.46%)	3647.17 (-35.62%)	CuTeDSL
4096	5707.21 (-0.10%)	5713.02 (best)	4242.58 (-25.74%)	CUTLASS
8192	6151.10 (best)	5991.68 (-2.59%)	3931.82 (-36.08%)	CuTeDSL

N=18432, K=7168

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	23.81 (-31.82%)	19.16 (-45.12%)	34.91 (best)	TRTLLM
2	48.89 (-30.44%)	47.55 (-32.35%)	70.29 (best)	TRTLLM
4	99.59 (-28.17%)	102.02 (-26.42%)	138.66 (best)	TRTLLM
8	232.06 (-19.13%)	206.99 (-27.86%)	286.94 (best)	TRTLLM
16	462.80 (-16.96%)	408.13 (-26.77%)	557.33 (best)	TRTLLM
32	929.33 (best)	825.79 (-11.14%)	553.18 (-40.48%)	CuTeDSL
64	2254.95 (best)	1712.88 (-24.04%)	898.07 (-60.17%)	CuTeDSL
128	3832.20 (best)	3373.04 (-11.98%)	3054.55 (-20.29%)	CuTeDSL
256	5070.73 (best)	4942.17 (-2.54%)	3256.12 (-35.79%)	CuTeDSL
512	5359.51 (best)	5307.24 (-0.98%)	3621.65 (-32.43%)	CuTeDSL
1024	5506.23 (best)	5430.92 (-1.37%)	3673.10 (-33.29%)	CuTeDSL
2048	5779.01 (best)	5636.57 (-2.46%)	3384.32 (-41.44%)	CuTeDSL
4096	5466.57 (-4.88%)	5746.94 (best)	3473.93 (-39.55%)	CUTLASS
8192	5681.96 (best)	5483.52 (-3.49%)	3490.44 (-38.57%)	CuTeDSL

N=7168, K=18432

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	17.29 (-12.04%)	15.99 (-18.66%)	19.66 (best)	TRTLLM
2	37.68 (-2.86%)	31.97 (-17.58%)	38.79 (best)	TRTLLM
4	73.76 (-6.20%)	63.85 (-18.79%)	78.63 (best)	TRTLLM
8	172.87 (best)	128.22 (-25.83%)	156.05 (-9.73%)	CuTeDSL
16	345.76 (best)	255.74 (-26.04%)	310.93 (-10.07%)	CuTeDSL
32	660.78 (best)	477.45 (-27.74%)	591.83 (-10.44%)	CuTeDSL
64	1377.38 (best)	1033.90 (-24.94%)	862.33 (-37.39%)	CuTeDSL
128	2582.43 (best)	1897.42 (-26.53%)	1702.98 (-34.05%)	CuTeDSL
256	4570.02 (best)	4432.55 (-3.01%)	2802.01 (-38.69%)	CuTeDSL
512	4834.91 (-0.37%)	4852.64 (best)	2930.76 (-39.60%)	CUTLASS
1024	5372.21 (best)	5212.10 (-2.98%)	3034.59 (-43.51%)	CuTeDSL
2048	5516.94 (best)	5284.15 (-4.22%)	3288.94 (-40.38%)	CuTeDSL
4096	5572.54 (-3.29%)	5762.37 (best)	3350.80 (-41.85%)	CUTLASS
8192	5002.21 (-8.12%)	5444.40 (best)	3707.88 (-31.90%)	CUTLASS

SM103

N=7168, K=2048

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	8.26 (-11.02%)	7.72 (-16.91%)	9.29 (best)	TRTLLM
2	16.80 (-2.49%)	14.93 (-13.33%)	17.23 (best)	TRTLLM
4	32.54 (-7.83%)	29.90 (-15.30%)	35.30 (best)	TRTLLM
8	75.32 (best)	59.74 (-20.68%)	69.53 (-7.69%)	CuTeDSL
16	146.08 (best)	120.41 (-17.58%)	139.20 (-4.71%)	CuTeDSL
32	304.86 (best)	239.83 (-21.33%)	257.04 (-15.68%)	CuTeDSL
64	616.77 (best)	479.07 (-22.33%)	394.67 (-36.01%)	CuTeDSL
128	1141.56 (best)	963.42 (-15.60%)	763.52 (-33.12%)	CuTeDSL
256	2039.46 (best)	1861.99 (-8.70%)	1479.90 (-27.44%)	CuTeDSL
512	2988.48 (best)	2736.61 (-8.43%)	1983.47 (-33.63%)	CuTeDSL
1024	3553.24 (-2.43%)	3641.57 (best)	2563.44 (-29.61%)	CUTLASS
2048	4550.94 (best)	4392.35 (-3.48%)	3245.54 (-28.68%)	CuTeDSL
4096	4920.28 (best)	4894.33 (-0.53%)	3558.20 (-27.68%)	CuTeDSL
8192	5148.52 (-1.64%)	5234.12 (best)	3413.34 (-34.79%)	CUTLASS

N=4096, K=7168

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	9.03 (-3.06%)	8.26 (-11.37%)	9.31 (best)	TRTLLM
2	19.18 (best)	17.22 (-10.23%)	18.43 (-3.89%)	CuTeDSL
4	38.59 (best)	37.81 (-2.03%)	36.83 (-4.56%)	CuTeDSL
8	81.02 (best)	73.13 (-9.74%)	73.80 (-8.91%)	CuTeDSL
16	161.90 (best)	146.44 (-9.55%)	147.29 (-9.03%)	CuTeDSL
32	324.05 (best)	293.18 (-9.53%)	292.25 (-9.81%)	CuTeDSL
64	653.75 (best)	584.33 (-10.62%)	412.80 (-36.86%)	CuTeDSL
128	1274.86 (best)	1151.05 (-9.71%)	807.53 (-36.66%)	CuTeDSL
256	2487.46 (best)	2320.52 (-6.71%)	1598.98 (-35.72%)	CuTeDSL
512	4331.05 (best)	4087.08 (-5.63%)	3075.78 (-28.98%)	CuTeDSL
1024	5160.79 (best)	5035.26 (-2.43%)	3388.57 (-34.34%)	CuTeDSL
2048	5865.40 (best)	5856.70 (-0.15%)	3677.14 (-37.31%)	CuTeDSL
4096	6209.12 (-3.70%)	6447.84 (best)	4181.73 (-35.15%)	CUTLASS
8192	5629.43 (best)	5615.45 (-0.25%)	3947.02 (-29.89%)	CuTeDSL

N=18432, K=7168

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	25.02 (-33.65%)	21.57 (-42.79%)	37.71 (best)	TRTLLM
2	50.75 (-31.76%)	47.81 (-35.72%)	74.37 (best)	TRTLLM
4	102.43 (-31.62%)	100.24 (-33.08%)	149.80 (best)	TRTLLM
8	246.51 (-19.50%)	208.99 (-31.75%)	306.21 (best)	TRTLLM
16	483.16 (-18.65%)	412.14 (-30.60%)	593.90 (best)	TRTLLM
32	963.08 (best)	832.79 (-13.53%)	585.98 (-39.16%)	CuTeDSL
64	2388.24 (best)	1765.84 (-26.06%)	938.92 (-60.69%)	CuTeDSL
128	4138.86 (best)	3577.81 (-13.56%)	3203.88 (-22.59%)	CuTeDSL
256	5331.70 (best)	5207.50 (-2.33%)	3343.82 (-37.28%)	CuTeDSL
512	5786.16 (best)	5264.24 (-9.02%)	3680.79 (-36.39%)	CuTeDSL
1024	6007.93 (best)	5940.19 (-1.13%)	3569.01 (-40.59%)	CuTeDSL
2048	5535.25 (best)	5356.93 (-3.22%)	3455.75 (-37.57%)	CuTeDSL
4096	5276.37 (-1.82%)	5374.20 (best)	3551.15 (-33.92%)	CUTLASS
8192	5346.58 (-4.00%)	5569.13 (best)	3568.93 (-35.92%)	CUTLASS

N=7168, K=18432

M	CuTeDSL	CUTLASS	TRTLLM	Winner
1	17.97 (-11.04%)	16.87 (-16.48%)	20.20 (best)	TRTLLM
2	41.76 (best)	33.07 (-20.81%)	40.28 (-3.53%)	CuTeDSL
4	82.51 (best)	65.98 (-20.02%)	79.96 (-3.08%)	CuTeDSL
8	177.90 (best)	132.71 (-25.40%)	161.40 (-9.28%)	CuTeDSL
16	353.05 (best)	280.15 (-20.65%)	317.09 (-10.19%)	CuTeDSL
32	704.78 (best)	551.51 (-21.75%)	621.58 (-11.80%)	CuTeDSL
64	1368.03 (best)	1074.71 (-21.44%)	779.15 (-43.05%)	CuTeDSL
128	2654.48 (best)	2082.84 (-21.53%)	1739.80 (-34.46%)	CuTeDSL
256	4591.57 (best)	4442.42 (-3.25%)	2904.12 (-36.75%)	CuTeDSL
512	5248.23 (-0.15%)	5255.86 (best)	3039.43 (-42.17%)	CUTLASS
1024	5598.81 (-6.27%)	5973.28 (best)	2949.47 (-50.62%)	CUTLASS
2048	5114.07 (-8.83%)	5609.35 (best)	3206.56 (-42.84%)	CUTLASS
4096	5386.57 (-8.61%)	5893.87 (best)	3284.85 (-44.27%)	CUTLASS
8192	4975.49 (-14.80%)	5839.75 (best)	3264.02 (-44.11%)	CUTLASS

LopezCastroRoberto · 2026-05-18T10:23:53Z

Thanks for the results, @mmangkad. Yeah, I think this makes sense.

We should also update the FI version to the latest, i.e. 0.6.11.post3?
https://github.com/LopezCastroRoberto/vllm/blob/main/requirements/cuda.txt

LopezCastroRoberto · 2026-05-18T10:29:16Z

cc: @mgoin

mmangkad · 2026-05-18T11:01:22Z

Thanks for the results, @mmangkad. Yeah, I think this makes sense.

We should also update the FI version to the latest, i.e. 0.6.11.post3? https://github.com/LopezCastroRoberto/vllm/blob/main/requirements/cuda.txt

@LopezCastroRoberto we are already at 0.6.11.post2 now, but I think we can include upgrade to 0.6.11.post3 here if you prefer that

LopezCastroRoberto · 2026-05-18T11:04:26Z

Nevermind, my bad. I accidentally checked my own fork instead of upstream. Waiting for @mgoin approval.

LopezCastroRoberto · 2026-05-18T11:46:47Z

Worth being aware of, BTW: flashinfer-ai/flashinfer#3295

mmangkad · 2026-05-18T11:48:08Z

Rebased after resolving conflicts caused by #39538 and aligning with its changes

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

LopezCastroRoberto · 2026-05-28T11:37:42Z

@mmangkad -- following up on the FlashInfer autotuning issue I flagged earlier (flashinfer-ai/flashinfer#3295). The discussion has progressed and there's now a concrete fix, so wanted to share the conclusions since they directly affect this PR.

Right now vLLM defaults to O2, which has enable_flashinfer_autotune=True (re-enabled in #42857). Once this PR lands making cute-dsl the highest-priority NVFP4 backend, every default NVFP4 deployment will autotune mm_fp4 cuteDSL kernels at startup.

Interestingly, seems like autotuning mm_fp4 cuteDSL is unnecessary. PR flashinfer-ai/flashinfer#2940 added a heuristic that closes the autotuned vs non-autotuned perf gap. The heuristic predicts the best config for each (N, K) combination in <100us on first call, and subsequent lookups are <0.2us.

To fix this, flashinfer-ai/flashinfer#3396 adds a skip_ops mechanism:

with flashinfer.autotune(skip_ops="fp4_gemm"):
      ...

This brought warmup from 587s → 8s on DSV3.2-NVFP4 TP=4.

I think we should track a follow-up to integrate skip_ops when the FI version is bumped to include #3396, and then merge this PR too.

cc: @mgoin

mmangkad requested review from DarkLight1337, WoosukKwon, mgoin, tlrmchlsmth, yewentao256, ywang96 and zyongye as code owners May 10, 2026 15:27

claude Bot reviewed May 10, 2026

View reviewed changes

mergify Bot added the nvidia label May 10, 2026

github-project-automation Bot added this to NVIDIA May 10, 2026

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Comment thread vllm/model_executor/kernels/linear/nvfp4/flashinfer.py

Comment thread tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py Outdated

Comment thread tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py

LopezCastroRoberto reviewed May 13, 2026

View reviewed changes

LopezCastroRoberto approved these changes May 18, 2026

View reviewed changes

mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch 2 times, most recently from 3b67dbc to d2c176d Compare May 18, 2026 11:41

mmangkad requested review from ProExpertProg, hmellor, houseroad, robertgshaw2-redhat and youkaichao as code owners May 18, 2026 11:41

mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch from d2c176d to 3700b17 Compare May 18, 2026 11:44

meena-at-work mentioned this pull request May 20, 2026

[CI/Build] Bump flashinfer to v0.6.11.post3 #43251

Open

qiching mentioned this pull request May 22, 2026

The autotune speed of mm_fp4 with backend=cute-dsl is slow flashinfer-ai/flashinfer#3295

Closed

Add FlashInfer cutedsl NVFP4 GEMM backend

c34ce76

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

mmangkad force-pushed the add-flashinfer-cutedsl-nvfp4 branch from 3700b17 to c34ce76 Compare May 23, 2026 04:31

Uh oh!

Conversation

mmangkad commented May 10, 2026

Summary

Performance Comparison

Test Plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LopezCastroRoberto left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmangkad commented May 13, 2026

Uh oh!

LopezCastroRoberto commented May 13, 2026

Uh oh!

mmangkad commented May 13, 2026

Uh oh!

LopezCastroRoberto commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmangkad commented May 14, 2026

FlashInfer NVFP4 GEMM Results

Overall Winners

SM100 winners

SM103 winners

SM100

N=7168, K=2048

N=4096, K=7168

N=18432, K=7168

N=7168, K=18432

SM103

N=7168, K=2048

N=4096, K=7168

N=18432, K=7168

N=7168, K=18432

Uh oh!

LopezCastroRoberto commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LopezCastroRoberto commented May 18, 2026

Uh oh!

mmangkad commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LopezCastroRoberto commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LopezCastroRoberto commented May 18, 2026

Uh oh!

mmangkad commented May 18, 2026

Uh oh!

LopezCastroRoberto commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LopezCastroRoberto left a comment •

edited

Loading

LopezCastroRoberto commented May 14, 2026 •

edited

Loading

LopezCastroRoberto commented May 18, 2026 •

edited

Loading

mmangkad commented May 18, 2026 •

edited

Loading

LopezCastroRoberto commented May 18, 2026 •

edited

Loading

LopezCastroRoberto commented May 28, 2026 •

edited

Loading