[Perf] add cute dsl kernel for gdn decode#36111
[Perf] add cute dsl kernel for gdn decode#36111ZJY0516 wants to merge 10 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new CuTe DSL kernel for GDN decode in Qwen3Next models, aimed at improving performance. The changes are gated by a new environment variable VLLM_GDN_DECODE_BACKEND. My review focuses on the implementation and integration of this new kernel. I've identified a potential performance issue on Hopper (SM90) GPUs where a slower, scalar code path might be taken instead of the optimized intrinsic-based path. This could undermine the performance goals of this PR for a key architecture.
|
|
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Because the GDN decode kernel accounts for a very small portion of the end-to-end runtime. FlashInfer does not accept non-contiguous states(vllm uses non-contiguous), after modifying it locally, it's faster than this kernel. |
|
FlashInfer has optimized version of GDN. The last PR in FI landed several days ago and in v0.6.6 will be available fully optimized version. |
When we come to decode phase, GDN consume around 35% for Qwen3.5-397B... Not sure why 35B is so small... |
Let me test qwen3.5 397B |
|
@vadiklyutiy I have updated the perf data for qwen 3.5 397B |
which senario? I can not reprduce this |
B200 |
Could you clarify what dim isn't contiguous?
Did you modify flashinfer or vllm? If flashinfer faster maybe it is worth to use flashinfer? |
|
comparison with flashinfer Micro benchmark for GDN decode opsH20 H200 |
|
what about another batch sizes? |
|
|
Not so much on my machine 12.108s / 105.975s = 8% |
|
using cutedsl kernel from 12.108s to 8.617s. This is in line with the microbenchmark performance. |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
|
I ran on B200 and after several times of got |
|
Hi @ZJY0516, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@ZJY0516 ; #36111 (comment) also I am assuming this is STP (single token prediction) ? |
|
fp32 state T=1 gdn decode benchmark (B200, Qwen 3.5) bench cmd: FI-PreTr (FlashInfer Pretranspose, FP32 State) — Memory Bandwidth SOL
Here are the results of B200. #36111 (comment) measured by you I am assuming is B200? |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing as superseded by #36596 |
I think this PR is a bit different changes... |

Purpose
add a cutedsl kernel for gdn decode
cc @ywang96 @vadiklyutiy
Test Plan
Test Result
H20
main
cutedsl kernel
Micro benchmark for GDN decode ops
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.