-
Notifications
You must be signed in to change notification settings - Fork 841
perf: prefer cuDNN first for mm_fp4 on CUDA>=13 and cuDNN>=9.15 (SM100/SM103) #2664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mmangkad
wants to merge
1
commit into
flashinfer-ai:main
Choose a base branch
from
mmangkad:perf/mm-fp4-cudnn-priority-cuda13
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+4
−12
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 339
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 1462
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 476
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 272
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 272
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 84
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 1651
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 1143
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 657
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 1973
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 1101
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 422
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 2184
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 2321
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 1814
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
Repository: flashinfer-ai/flashinfer
Length of output: 578
Heuristic lacks SM100/SM103 guard despite documented scope.
The
_heuristic_func_mm_fp4implementation at line 3646 prioritizes cudnn whenever CUDA>=13 and cuDNN>=9.15, without checking architecture. The docstring promises this behavior only for SM100/SM103, but the code applies it to all supported architectures (SM110, SM120, SM121 also have suitable backends). This can unintentionally shift backend preferences on architectures where it wasn't intended.The fix should add an SM version check using the existing
_match_sm_version()utility:Suggested fix
🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right, but I basically reverted to the older heuristic, which means back then the cuDNN-first priority also applied to SM110/SM120/SM121. I’m not sure now if this is also the best for those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mmangkadYou're right to be cautious about SM110/SM120/SM121. Looking at the PR description, the benchmark data only covers SM100 and SM103—there's no performance comparison for SM110/SM120/SM121 with the cuDNN-first heuristic.My recommendation: Add the SM100/SM103 guard as suggested. Here's why:
Align with tested scope: The PR's rationale ("cuDNN is reported to be better in most cases on SM103") and benchmark plots only validate SM100 and SM103. Applying the change to SM110/SM120/SM121 without data is extending beyond the tested scope.
Conservative approach: Without benchmarks, we can't be sure cuDNN-first is optimal for SM110/SM120/SM121. The guard keeps the behavior for those architectures unchanged until validated.
Separate decision: If you want cuDNN-first for SM110/SM120/SM121, that should be a separate PR with:
Trade-off: The current code (without guard) applies a potentially unverified optimization to SM110/SM120/SM121. The guard (conservative) keeps them on the existing CUTLASS-first heuristic until you have data showing cuDNN is better.
Would you like to:
🧠 Learnings used