Skip to content

[NVIDIA] Relax previous restraints for using flashinfer gdn decode#22818

Closed
kaixih wants to merge 1 commit intosgl-project:mainfrom
kaixih:improve_flashinfer_gdn_decode
Closed

[NVIDIA] Relax previous restraints for using flashinfer gdn decode#22818
kaixih wants to merge 1 commit intosgl-project:mainfrom
kaixih:improve_flashinfer_gdn_decode

Conversation

@kaixih
Copy link
Copy Markdown
Collaborator

@kaixih kaixih commented Apr 14, 2026

The combination of --linear-attn-decode-backend flashinfer and --mamba-scheduler-strategy no_buffer was previously blocked with a hard ValueError in server_args.py. The root cause was a bug in the FlashInfer bf16 CuTe-DSL kernel (SM100+/Blackwell): CUDA graph decode uses padded batches where padding slots have initial_state_indices = -1, but the kernel had no guard for negative indices. Now the flashinfer side has fixed it (flashinfer-ai/flashinfer#2810, and later flashinfer-ai/flashinfer#2679).

Also added tests and got the results below on 8xB200:

============================================================
Qwen3.5-397B-A17B-NVFP4 Results Summary
Dataset: gsm8k
Baseline: 0.95
============================================================

Model 1: nvidia/Qwen3.5-397B-A17B-NVFP4
  Accuracy: PASS
  Score: 0.990

Model 2: nvidia/Qwen3.5-397B-A17B-NVFP4
  Accuracy: PASS
  Score: 0.990

Model 3: nvidia/Qwen3.5-397B-A17B-NVFP4
  Accuracy: PASS
  Score: 0.995

============================================================
OVERALL: ALL TESTS PASSED
============================================================

.
----------------------------------------------------------------------
Ran 1 test in 940.944s

OK

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kaixih
Copy link
Copy Markdown
Collaborator Author

kaixih commented Apr 14, 2026

cc. @Fridge003 @hlu1

@kaixih kaixih changed the title Relax previous restraints for using flashinfer gdn decode [NVIDIA] Relax previous restraints for using flashinfer gdn decode Apr 14, 2026
@kaixih kaixih requested a review from Qiaolin-Yu April 14, 2026 18:18
@kaixih
Copy link
Copy Markdown
Collaborator Author

kaixih commented Apr 14, 2026

Sorry, this has already fixed the issue: #21861.

Closing.

@kaixih kaixih closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant