[Fix]Upgrade FlashInfer to v0.6.4+ to Resolve SM90 Performance Regression by cswuyg · Pull Request #99 · sgl-project/mini-sglang

cswuyg · 2026-03-09T09:11:31Z

🚀 Problem Description

On SM90 (Hopper) architectures, FlashInfer versions prior to v0.6.4 exhibit a significant performance regression within the CUDAGraphBatchDecodeWithPagedKVCacheWrapper.

Root Cause: In earlier versions, the auto backend incorrectly defaults to FA3 (FlashAttention-3).
Performance Impact: Benchmarks during the Decode phase under long sequence evaluations show that the PrefillWithKVCacheKernel (FA3) performs approximately 5x slower than the BatchPrefillWithPagedKVCacheKernel (FA2).

🛠️ Proposed Solution

We are specifying the FlashInfer dependency version to be >= v0.6.4.

Two solutions were considered:

Manual Workaround (Rejected): Replace CUDAGraphBatchDecodeWithPagedKVCacheWrapper with BatchDecodeWithPagedKVCacheWrapper and manually specify backend="fa2".
Version Upgrade (Selected): Enforce a version bump to leverage the fixed auto dispatch logic in newer releases.

📊 Benchmark Comparison (Decode Phase - Long Sequence)

Backend	Kernel	Performance
FA3 (Old Auto)	`PrefillWithKVCacheKernel`	~5x Latency ❌
FA2 (New Auto)	`BatchPrefillWithPagedKVCacheKernel`	Baseline ✅

🔗 FlashInfer Fix Reference

Pull Request: flashinfer-ai/flashinfer#2530

DarkSharpness · 2026-03-10T02:47:06Z

Thanks! We already fix the attn backend to fa2 long ago to avoid the regression.

mini-sglang/python/minisgl/attention/fi.py

Lines 93 to 103 in c7f800d

    
           self.prefill_wrapper = BatchPrefillWithPagedKVCacheWrapper( 
        
               self.float_workspace_buffer, 
        
               kv_layout="NHD", 
        
               backend="fa2",  # flashinfer fa3 is slow, use fa2 instead 
        
           ) 
        
           self.decode_wrappers = BatchDecodeWithPagedKVCacheWrapper( 
        
               self.float_workspace_buffer, 
        
               use_tensor_cores=self.use_tensor_cores, 
        
               kv_layout="NHD", 
        
               backend="fa2",  # flashinfer fa3 is slow, use fa2 instead 
        
           )

mini-sglang/python/minisgl/attention/fi.py

Line 253 in c7f800d

self.graph_wrappers[bs]._backend = "fa2"

Requires FlashInfer >= v0.6.4

11c5407

cswuyg closed this Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix]Upgrade FlashInfer to v0.6.4+ to Resolve SM90 Performance Regression#99

[Fix]Upgrade FlashInfer to v0.6.4+ to Resolve SM90 Performance Regression#99
cswuyg wants to merge 1 commit intosgl-project:mainfrom
cswuyg:feature/cswuyg_flashinfer_version2

cswuyg commented Mar 9, 2026

Uh oh!

DarkSharpness commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cswuyg commented Mar 9, 2026

🚀 Problem Description

🛠️ Proposed Solution

📊 Benchmark Comparison (Decode Phase - Long Sequence)

🔗 FlashInfer Fix Reference

Uh oh!

DarkSharpness commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants