[GDN] Remove FlashInfer GDN decode + no_buffer guard and default to FlashInfer on SM100+ #21861
Open
YAMY1234 wants to merge 2 commits intosgl-project:mainfrom
Open
[GDN] Remove FlashInfer GDN decode + no_buffer guard and default to FlashInfer on SM100+ #21861YAMY1234 wants to merge 2 commits intosgl-project:mainfrom
YAMY1234 wants to merge 2 commits intosgl-project:mainfrom
Conversation
The root cause (OOB memory access from negative padding indices in bf16 decode kernel) was fixed in FlashInfer v0.6.7 via flashinfer-ai/flashinfer#2810. Verified on Qwen3.5-397B-A17B-NVFP4 (4xGB200, no_buffer + disable-radix-cache + --linear-attn-decode-backend flashinfer): - GSM8K accuracy: 0.977-0.979 across conc=128/512 - sa-bench TPOT improvement: 1-5% vs baseline (no ladfi) Closes sgl-project#20791
On SM100+ with mamba-ssm-dtype=bfloat16, automatically set --linear-attn-decode-backend to flashinfer when not explicitly specified. This gives 1-5% TPOT improvement at higher concurrencies. The prerequisite bug (OOB from negative padding indices in bf16 decode kernel) was fixed in FlashInfer v0.6.7 via flashinfer-ai/flashinfer#2810. Verified on Qwen3.5-397B-A17B-NVFP4 (4xGB200, no_buffer + disable-radix-cache), sa-bench ISL=1024 OSL=1024, conc 2-1024: - GSM8K accuracy: 0.977-0.979 - Mean TPOT: -1.3% (conc=2) to -4.5% (conc=1024) - Excluded when MTP speculative decoding is active (not yet supported) - Output throughput: +1.3% (conc=2) to +4.7% (conc=1024)
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
Author
|
/tag-and-rerun-ci |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The FlashInfer GDN decode kernel was previously blocked from being used with
--mamba-scheduler-strategy no_bufferdue to accuracy degradation caused by OOB memory access from negative padding indices in the bf16 decode kernel (see #20791).The root cause was fixed in FlashInfer v0.6.7 via flashinfer-ai/flashinfer#2810 (padding index guard for bf16 decode kernel). Thanks to @kaixih!
With PR #21422 merged (upgrading FlashInfer to v0.6.7), we are able to remove this guard and proceed with further benchmarking.
Modifications
raise ValueErrorguard that blocked--linear-attn-decode-backend flashinferwith--mamba-scheduler-strategy no_buffer.mamba-ssm-dtype=bfloat16and no explicit decode backend is specified. This is excluded when MTP speculative decoding is active, since FlashInfer GDN MTP verify is not yet supported on SM100+.Benchmarking
Accuracy Validation
Verified on Qwen3.5-397B-A17B (4×GB200), GSM8K (1319 examples, 8-shot, temp=0.6):
Performance
sa-bench, ISL=1024, OSL=1024, NVFP4, 4×GB200
Comparison:
--linear-attn-decode-backend flashinfervs default (triton)TPOT improves across all concurrency levels, with gains increasing at higher concurrencies, reaching up to 4.5% TPOT reduction and 4.7% throughput improvement at
conc=1024.Closes #20791
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci