[NVIDIA] [GDN] Enable FlashInfer MTP verify on SM100+ (Blackwell)#23273
Open
wenscarl wants to merge 1 commit into
Open
[NVIDIA] [GDN] Enable FlashInfer MTP verify on SM100+ (Blackwell)#23273wenscarl wants to merge 1 commit into
wenscarl wants to merge 1 commit into
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
mmangkad
added a commit
to mmangkad-dev/sglang
that referenced
this pull request
Apr 28, 2026
…fy on SM100+ (Blackwell) Resolved conflicts with PR sgl-project#22921: - gdn_flashinfer.py: combined module and class docstrings to reflect that SM100+ now supports decode, prefill, and MTP verify. - gdn_flashinfer.py target_verify: dropped the SM100+ NotImplementedError guard so the pool-API MTP path runs on both SM90 and SM100+. - server_args.py: kept the bf16 gate from sgl-project#22921 and removed the speculative_algorithm gate now that MTP verify is supported on SM100+.
mmangkad
added a commit
to mmangkad-dev/sglang
that referenced
this pull request
Apr 28, 2026
…fy on SM100+ (Blackwell) Resolved conflicts with PR sgl-project#22921: - gdn_flashinfer.py: combined module and class docstrings to reflect that SM100+ now supports decode, prefill, and MTP verify. - gdn_flashinfer.py target_verify: dropped the SM100+ NotImplementedError guard so the pool-API MTP path runs on both SM90 and SM100+. - server_args.py: kept the bf16 gate from sgl-project#22921 and removed the speculative_algorithm gate now that MTP verify is supported on SM100+.
mmangkad
added a commit
to mmangkad-dev/sglang
that referenced
this pull request
Apr 28, 2026
PR sgl-project#22921 renamed the SM-gating attribute from use_state_pool to is_sm100plus (updating all existing call sites). PR sgl-project#23273 was authored against the older name and added a new reference in the bf16 MTP adapter setup. The git auto-merge picked up sgl-project#22921's renames and sgl-project#23273's new block, leaving a single dangling use_state_pool access that crashed at FlashInferGDNKernel.__init__. Rename the one remaining reference to is_sm100plus to match the rest of the class.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[GDN] Enable FlashInfer MTP verify on SM100+ (Blackwell)
co-authored by @YAMY1234 (main contributor)
Summary
Enables FlashInfer GDN MTP (speculative decoding) verify on SM100+ (Blackwell) hardware, previously raising
NotImplementedError. SM90 (Hopper) MTP was already supported; this PR completes SM100+ coverage.Root cause:
target_verifyguarded onuse_state_pool, blocking SM100+ even though the FlashInfergated_delta_rule_mtpkernel already acceptsinitial_state_indices(pool API) — the same API used by the SM90 path.Changes (2 files, ~15 lines):
gdn_flashinfer.py: removeuse_state_poolguard intarget_verify; unify SM90 + SM100+ into a single pool-API path; addA_log.detach().float()cast (matches SM100+ decode path, no-op on SM90).server_args.py: removeand self.speculative_algorithm is Nonefrom the SM100+ FlashInfer auto-default — FlashInfer is now safe to default on SM100+ regardless of whether MTP is enabled.Accuracy (Qwen3.5-397B-A17B-NVFP4, B200)
gsm8k (TODO: examples, baseline threshold: 0.95)
GPQA Diamond (TODO: examples, repeat=8, temperature=0.6)
and
Throughput Benchmark (GB200, Qwen3.5-397B-A17B-NVFP4, TP=4)
Focus: long output sequence length (OSL), where per-step GDN state-update cost is most significant.
Server settings:
--tp-size 4 --max-running-requests 128--mamba-ssm-dtype bfloat16 --mamba-scheduler-strategy no_buffer --mamba-track-interval 128--attention-backend trtllm_mha --linear-attn-decode-backend <triton|flashinfer>--speculative-algorithm NEXTN(MTP runs)--disable-radix-cache --quantization modelopt_fp4Benchmark settings:
--dataset-name random --random-input-len 32 --random-output-len <512|1024|2048|4096>--num-prompts <varied> --request-rate infDecode throughput (w/ MTP), output throughput( tok/s) — ISL=32
acc len: 3.13-3.22
num_prompts: 256
Mean TPOT (ms/tok), ISL=32, OSL=512
Requirements
The traces are collected at ISL: 32 OSL: 512, CC: 64

Flashinfer:
triton:
