[PERF] Speed-up of GDN attention decode part (Qwen3-Next)#31722
[PERF] Speed-up of GDN attention decode part (Qwen3-Next)#31722mgoin merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request aims to speed up the GDN attention decode operation by increasing the block size for the V dimension (BV) in the Triton kernel from 8 to 32. The provided benchmarks show significant performance improvements on H200 and B200 hardware.
While this is a great optimization for newer GPU architectures, the increased BV can lead to excessive shared memory usage (BK * BV * 4 bytes), potentially causing runtime errors on GPUs with more limited shared memory per block, such as the NVIDIA T4 (Turing architecture, CC 7.5), which has a 48KB limit.
I've added a review comment with a suggestion to dynamically adjust the BV limit based on the GPU's compute capability to maintain compatibility with a wider range of hardware while still enabling the performance benefits on capable devices.
|
cc @ZJY0516, @sighingnow |
…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Speed Up GDN Attention (Decode) -
fused_recurrent_gated_delta_rule_fwdBenchmarks
H200 —
fused_recurrent_gated_delta_rule_fwdwith shapes from Qwen3-NextBefore
After
B200 —
fused_recurrent_gated_delta_rule_fwdwith shapes from Qwen3-NextBefore
After
End-to-End Decode
Server Launch
Benchmark Command
Results
Before
After
Speedup: 16.8%