[Fix]Upgrade FlashInfer to v0.6.4+ to Resolve SM90 Performance Regression#99
Closed
cswuyg wants to merge 1 commit intosgl-project:mainfrom
Closed
[Fix]Upgrade FlashInfer to v0.6.4+ to Resolve SM90 Performance Regression#99cswuyg wants to merge 1 commit intosgl-project:mainfrom
cswuyg wants to merge 1 commit intosgl-project:mainfrom
Conversation
Collaborator
|
Thanks! We already fix the attn backend to fa2 long ago to avoid the regression. mini-sglang/python/minisgl/attention/fi.py Lines 93 to 103 in c7f800d mini-sglang/python/minisgl/attention/fi.py Line 253 in c7f800d |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Problem Description
On SM90 (Hopper) architectures, FlashInfer versions prior to v0.6.4 exhibit a significant performance regression within the
CUDAGraphBatchDecodeWithPagedKVCacheWrapper.autobackend incorrectly defaults to FA3 (FlashAttention-3).PrefillWithKVCacheKernel(FA3) performs approximately 5x slower than theBatchPrefillWithPagedKVCacheKernel(FA2).🛠️ Proposed Solution
We are specifying the FlashInfer dependency version to be >= v0.6.4.
Two solutions were considered:
CUDAGraphBatchDecodeWithPagedKVCacheWrapperwithBatchDecodeWithPagedKVCacheWrapperand manually specifybackend="fa2".autodispatch logic in newer releases.📊 Benchmark Comparison (Decode Phase - Long Sequence)
PrefillWithKVCacheKernelBatchPrefillWithPagedKVCacheKernel🔗 FlashInfer Fix Reference