[WIP][Perf] Add FlashInfer CuTeDSL backend for NVFP4 GEMM on Blackwell#39933
[WIP][Perf] Add FlashInfer CuTeDSL backend for NVFP4 GEMM on Blackwell#39933LopezCastroRoberto wants to merge 1 commit into
Conversation
LopezCastroRoberto
commented
Apr 15, 2026
There was a problem hiding this comment.
Code Review
This pull request updates flashinfer-python to version 0.6.7 and introduces the cute-dsl backend for NVFP4 quantization. The changes include updates to the backend enum, weight preparation logic, and kernel tests. A critical feedback point notes that bypassing the flashinfer_mm_fp4 custom operator in favor of a direct library call will likely break CUDA graph capture; it is recommended to update the custom operator to accept the new parameters instead.
| from flashinfer import mm_fp4 as _flashinfer_mm_fp4 | ||
|
|
||
| return _flashinfer_mm_fp4( | ||
| a, | ||
| b.t(), | ||
| block_scale_a, | ||
| block_scale_b.t(), | ||
| alpha, | ||
| out_dtype, | ||
| block_size=16, | ||
| use_8x4_sf_layout=use_8x4_sf_layout, | ||
| backend=backend, | ||
| use_nvfp4=True, | ||
| ) |
There was a problem hiding this comment.
Bypassing the vllm::flashinfer_mm_fp4 custom op by calling flashinfer.mm_fp4 directly will likely break CUDA graph capture, which is a key performance feature in vLLM. This can lead to performance regressions.
Instead of bypassing the custom op, please update its definition (and its fake implementation) to accept the use_nvfp4 parameter and pass it to the underlying flashinfer.mm_fp4 call. The custom op is defined in this same file, so it should be straightforward to modify.
After updating the custom op, you can call it from here like this:
return flashinfer_mm_fp4(
a,
b.t(),
block_scale_a,
block_scale_b.t(),
alpha,
out_dtype,
use_8x4_sf_layout=use_8x4_sf_layout,
backend=backend,
use_nvfp4=True,
)|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |