Support Flashinfer Cute-DSL MLA attention#24737
Open
b8zhong wants to merge 5 commits into
Open
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
a427013 to
88791de
Compare
Collaborator
|
cc @leejnau |
leejnau
reviewed
May 11, 2026
27abb6f to
e4ab12b
Compare
leejnau
reviewed
May 11, 2026
leejnau
approved these changes
May 12, 2026
Collaborator
|
Just FYI, more FlashInfer optimizations for cute-dsl MLA decode: flashinfer-ai/flashinfer#3309 Request to autotune between trtllm-gen and cutedsl MLA in FlashInfer: flashinfer-ai/flashinfer#2891 |
Qiaolin-Yu
approved these changes
May 14, 2026
bffedb1 to
c33e2de
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
@nvpohanh
Ref:
flashinfer-ai/flashinfer#2805
flashinfer-ai/flashinfer#2743
(closed, could need new/reopened PR) Flashinfer autotune PR: flashinfer-ai/flashinfer#3086
Modifications
Add as a new backend (for the purposes of debugging and easily switching impls for now, ideally, in the future, the
trtllm_mlabackend will still be allowed to be autotuned to use the cute-dsl implementation, as we don't pass in the explicit backend string.The little int8 change is because, for some weird reason, the
cute-dslMLA backend doesn't support unsigned. Change it for simplicity because no real difference.Kernel limitations:
Head dim only support: DSR1 DP attention, only support decode (Kimi K2 dim through padding here: flashinfer-ai/flashinfer#3161)
Accuracy Tests
Before:
After
After in TP mode:
Looks fine.
Speed Tests and Profiling
On SM103
Before:

After:

Attention speedup: around 18%
For TP mode:
Before:

After:

CI States
Latest PR Test (Base): ✅ Run #26258690377
Latest PR Test (Extra): ❌ Run #26258690334