[Perf] flashinfer gdn decode kernel intergration#35292
[Perf] flashinfer gdn decode kernel intergration#35292ZJY0516 wants to merge 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces performance optimizations for the Qwen3-Next model by integrating FlashInfer kernels for Gated Delta Net (GDN) decode and Multi-Token Processing (MTP). The changes are well-structured, adding new optimized code paths while retaining the existing implementations as fallbacks. This ensures correctness and compatibility on hardware that does not support the new kernels. The logic for selecting the appropriate kernel based on hardware capabilities and input shapes appears to be correct. The state management for speculative decoding with the new MTP kernel is also handled properly. Overall, the changes are a solid performance enhancement, and I have no high or critical severity concerns.
|
Hi, does this support |
decode should support full cuda graph |
|
wait for flashinfer-ai/flashinfer#2727 |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Do not review it
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.