Skip to content

[WIP][Perf] Add FlashInfer CuTeDSL backend for NVFP4 GEMM on Blackwell#39933

Draft
LopezCastroRoberto wants to merge 1 commit into
vllm-project:mainfrom
LopezCastroRoberto:perf/fp4_cute-dsl
Draft

[WIP][Perf] Add FlashInfer CuTeDSL backend for NVFP4 GEMM on Blackwell#39933
LopezCastroRoberto wants to merge 1 commit into
vllm-project:mainfrom
LopezCastroRoberto:perf/fp4_cute-dsl

Conversation

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

nvfp4_b300_n10240_k8192

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates flashinfer-python to version 0.6.7 and introduces the cute-dsl backend for NVFP4 quantization. The changes include updates to the backend enum, weight preparation logic, and kernel tests. A critical feedback point notes that bypassing the flashinfer_mm_fp4 custom operator in favor of a direct library call will likely break CUDA graph capture; it is recommended to update the custom operator to accept the new parameters instead.

Comment thread vllm/utils/flashinfer.py
Comment on lines +577 to 590
from flashinfer import mm_fp4 as _flashinfer_mm_fp4

return _flashinfer_mm_fp4(
a,
b.t(),
block_scale_a,
block_scale_b.t(),
alpha,
out_dtype,
block_size=16,
use_8x4_sf_layout=use_8x4_sf_layout,
backend=backend,
use_nvfp4=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Bypassing the vllm::flashinfer_mm_fp4 custom op by calling flashinfer.mm_fp4 directly will likely break CUDA graph capture, which is a key performance feature in vLLM. This can lead to performance regressions.

Instead of bypassing the custom op, please update its definition (and its fake implementation) to accept the use_nvfp4 parameter and pass it to the underlying flashinfer.mm_fp4 call. The custom op is defined in this same file, so it should be straightforward to modify.

After updating the custom op, you can call it from here like this:

    return flashinfer_mm_fp4(
        a,
        b.t(),
        block_scale_a,
        block_scale_b.t(),
        alpha,
        out_dtype,
        use_8x4_sf_layout=use_8x4_sf_layout,
        backend=backend,
        use_nvfp4=True,
    )

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 15, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LopezCastroRoberto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 15, 2026
@LopezCastroRoberto LopezCastroRoberto changed the title [Perf] Add FlashInfer CuTeDSL backend for NVFP4 GEMM on Blackwell [WIP][Perf] Add FlashInfer CuTeDSL backend for NVFP4 GEMM on Blackwell Apr 17, 2026
@mergify mergify Bot removed the needs-rebase label May 18, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 18, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LopezCastroRoberto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant