[GDN] Remove FlashInfer GDN decode + no_buffer guard and default to FlashInfer on SM100+ by YAMY1234 · Pull Request #21861 · sgl-project/sglang

YAMY1234 · 2026-04-01T17:35:57Z

Motivation

The FlashInfer GDN decode kernel was previously blocked from being used with --mamba-scheduler-strategy no_buffer due to accuracy degradation caused by OOB memory access from negative padding indices in the bf16 decode kernel (see #20791).

The root cause was fixed in FlashInfer v0.6.7 via flashinfer-ai/flashinfer#2810 (padding index guard for bf16 decode kernel). Thanks to @kaixih!

With PR #21422 merged (upgrading FlashInfer to v0.6.7), we are able to remove this guard and proceed with further benchmarking.

Modifications

Remove the raise ValueError guard that blocked --linear-attn-decode-backend flashinfer with --mamba-scheduler-strategy no_buffer.
Default to FlashInfer GDN decode on SM100+ when mamba-ssm-dtype=bfloat16 and no explicit decode backend is specified. This is excluded when MTP speculative decoding is active, since FlashInfer GDN MTP verify is not yet supported on SM100+.

Benchmarking

Accuracy Validation

Verified on Qwen3.5-397B-A17B (4×GB200), GSM8K (1319 examples, 8-shot, temp=0.6):

Scenario	Quantization	Scheduler	MTP	conc=128	conc=512
NVFP4 + no_buffer	modelopt_fp4	no_buffer	-	0.977	0.977
NVFP4 + extra_buffer	modelopt_fp4	extra_buffer	-	0.979	0.977
FP8 + no_buffer	fp8	no_buffer	-	0.978	0.977
FP8 + extra_buffer	fp8	extra_buffer	-	0.979	0.979
NVFP4 + MTP	modelopt_fp4	extra_buffer	NEXTN	N/A	N/A

MTP is excluded because FlashInfer GDN MTP verify is not yet supported on SM100+.

Performance

sa-bench, ISL=1024, OSL=1024, NVFP4, 4×GB200
Comparison: --linear-attn-decode-backend flashinfer vs default (triton)

Concurrency	Baseline Mean TPOT	LADFI Mean TPOT	Difference	Output tok/s Improvement
2	5.68 ms	5.61 ms	-1.3%	+1.3%
4	6.43 ms	6.35 ms	-1.2%	+0.5%
8	7.33 ms	7.29 ms	-0.6%	+1.2%
16	8.58 ms	8.54 ms	-0.5%	+0.5%
32	10.86 ms	10.77 ms	-0.8%	+0.7%
64	13.61 ms	13.43 ms	-1.3%	+1.6%
128	17.36 ms	17.16 ms	-1.2%	+1.2%
256	23.60 ms	23.04 ms	-2.4%	+2.3%
512	33.70 ms	32.47 ms	-3.7%	+3.6%
1024	53.50 ms	51.08 ms	-4.5%	+4.7%

TPOT improves across all concurrency levels, with gains increasing at higher concurrencies, reaching up to 4.5% TPOT reduction and 4.7% throughput improvement at conc=1024.

Closes #20791

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

The root cause (OOB memory access from negative padding indices in bf16 decode kernel) was fixed in FlashInfer v0.6.7 via flashinfer-ai/flashinfer#2810. Verified on Qwen3.5-397B-A17B-NVFP4 (4xGB200, no_buffer + disable-radix-cache + --linear-attn-decode-backend flashinfer): - GSM8K accuracy: 0.977-0.979 across conc=128/512 - sa-bench TPOT improvement: 1-5% vs baseline (no ladfi) Closes sgl-project#20791

On SM100+ with mamba-ssm-dtype=bfloat16, automatically set --linear-attn-decode-backend to flashinfer when not explicitly specified. This gives 1-5% TPOT improvement at higher concurrencies. The prerequisite bug (OOB from negative padding indices in bf16 decode kernel) was fixed in FlashInfer v0.6.7 via flashinfer-ai/flashinfer#2810. Verified on Qwen3.5-397B-A17B-NVFP4 (4xGB200, no_buffer + disable-radix-cache), sa-bench ISL=1024 OSL=1024, conc 2-1024: - GSM8K accuracy: 0.977-0.979 - Mean TPOT: -1.3% (conc=2) to -4.5% (conc=1024) - Excluded when MTP speculative decoding is active (not yet supported) - Output throughput: +1.3% (conc=2) to +4.7% (conc=1024)

gemini-code-assist · 2026-04-01T17:36:02Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

YAMY1234 · 2026-04-01T21:57:33Z

/tag-and-rerun-ci

YAMY1234 added 2 commits April 1, 2026 00:25

github-actions bot added the run-ci label Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GDN] Remove FlashInfer GDN decode + no_buffer guard and default to FlashInfer on SM100+ #21861

[GDN] Remove FlashInfer GDN decode + no_buffer guard and default to FlashInfer on SM100+ #21861
YAMY1234 wants to merge 2 commits intosgl-project:mainfrom
YAMY1234:remove-gdn-no-buffer-guard

YAMY1234 commented Apr 1, 2026

Uh oh!

gemini-code-assist bot commented Apr 1, 2026

Uh oh!

YAMY1234 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YAMY1234 commented Apr 1, 2026

Motivation

Modifications

Benchmarking

Accuracy Validation

Performance

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot commented Apr 1, 2026

Uh oh!

YAMY1234 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant