[PERF] Speed-up of GDN attention decode part (Qwen3-Next) by vadiklyutiy · Pull Request #31722 · vllm-project/vllm

vadiklyutiy · 2026-01-05T12:37:58Z

Speed Up GDN Attention (Decode) - fused_recurrent_gated_delta_rule_fwd

Benchmarks

H200 — `fused_recurrent_gated_delta_rule_fwd` with shapes from Qwen3-Next

Before

Batch Size	TP=4 (μs)	TP=2 (μs)
32	19.87 ± 0.08	30.14 ± 0.02
128	52.06 ± 0.02	93.38 ± 0.02
256	94.24 ± 0.02	176.10 ± 0.02
512	177.73 ± 0.03	341.63 ± 0.03
1024	345.12 ± 0.20	672.48 ± 0.05

After

Batch Size	TP=4 (μs)	TP=2 (μs)
32	13.60 ± 0.07	20.29 ± 0.02
128	33.76 ± 0.02	59.17 ± 0.02
256	60.29 ± 0.02	111.14 ± 0.36
512	113.28 ± 0.05	212.70 ± 0.11
1024	215.12 ± 0.13	420.22 ± 0.20

B200 — `fused_recurrent_gated_delta_rule_fwd` with shapes from Qwen3-Next

Before

Batch Size	TP=4 (μs)	TP=2 (μs)
32	12.26 ± 0.02	19.23 ± 0.01
128	34.91 ± 0.01	63.58 ± 0.01
256	64.80 ± 0.01	122.30 ± 0.01
512	124.64 ± 0.02	239.23 ± 0.02
1024	244.60 ± 0.02	473.53 ± 0.05

After

Batch Size	TP=4 (μs)	TP=2 (μs)
32	8.22 ± 0.02	14.11 ± 0.01
128	24.54 ± 0.01	42.21 ± 0.01
256	42.11 ± 0.02	80.29 ± 0.02
512	80.80 ± 0.03	156.64 ± 0.06
1024	154.94 ± 0.07	305.21 ± 0.21

End-to-End Decode

Server Launch

VLLM_USE_FLASHINFER_MOE_FP8=1 \
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  -tp 2 \
  --enable-expert-parallel \
  --async-scheduling \
  --no-enable-prefix-caching \
  --compilation_config.max_cudagraph_capture_size 2048

Benchmark Command

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 32 \
  --random-output 1000 \
  --max-concurrency 1024 \
  --num-prompt 1024 \
  --ignore-eos

Results

Before

Output token throughput: 23891 tok/s

After

Output token throughput: 27904 tok/s

Speedup: 16.8%

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

gemini-code-assist

Code Review

This pull request aims to speed up the GDN attention decode operation by increasing the block size for the V dimension (BV) in the Triton kernel from 8 to 32. The provided benchmarks show significant performance improvements on H200 and B200 hardware.

While this is a great optimization for newer GPU architectures, the increased BV can lead to excessive shared memory usage (BK * BV * 4 bytes), potentially causing runtime errors on GPUs with more limited shared memory per block, such as the NVIDIA T4 (Turing architecture, CC 7.5), which has a 48KB limit.

I've added a review comment with a suggestion to dynamically adjust the BV limit based on the GPU's compute capability to maintain compatibility with a wider range of hardware while still enabling the performance benefits on capable devices.

vllm/model_executor/layers/fla/ops/fused_recurrent.py

vadiklyutiy · 2026-01-05T12:43:23Z

cc @ZJY0516, @sighingnow

ZJY0516

LGTM

vadiklyutiy · 2026-01-06T15:14:54Z

Seems multi-modal-processor-test-cpu fails on top of the tree

…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

speed-up GDN attention

14fb1b9

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy requested a review from youkaichao January 5, 2026 12:38

mergify bot added the qwen Related to Qwen models label Jan 5, 2026

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

vllm/model_executor/layers/fla/ops/fused_recurrent.py Show resolved Hide resolved

vadiklyutiy requested a review from sighingnow January 5, 2026 12:43

vadiklyutiy self-assigned this Jan 5, 2026

ZJY0516 approved these changes Jan 5, 2026

View reviewed changes

vadiklyutiy mentioned this pull request Jan 5, 2026

[Tracking Issue]: Qwen3-next performance optimisations #27225

Closed

12 tasks

mgoin approved these changes Jan 5, 2026

View reviewed changes

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed labels Jan 5, 2026

mgoin enabled auto-merge (squash) January 5, 2026 20:11

Merge branch 'main' into vadim/speedup-gdn-dec

ebbdb7c

Merge branch 'main' into vadim/speedup-gdn-dec

6bcf97e

mgoin merged commit 22dffca into vllm-project:main Jan 6, 2026
47 checks passed

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026

[PERF] Speed-up of GDN attention decode part (Qwen3-Next) (vllm-proje…

0ad1f11

…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

samuellees mentioned this pull request Jan 15, 2026

Optimize GDN decode for Qwen3 Next sgl-project/sglang#17094

Merged

5 tasks

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[PERF] Speed-up of GDN attention decode part (Qwen3-Next) (vllm-proje…

e0a9175

…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[PERF] Speed-up of GDN attention decode part (Qwen3-Next) (vllm-proje…

aa7da72

…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[PERF] Speed-up of GDN attention decode part (Qwen3-Next) (vllm-proje…

94d9b85

…ct#31722) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

This was referenced Mar 4, 2026

[Performance]: Optimize GDN Decode #35149

Open

[Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next #35777

Merged

vadiklyutiy deleted the vadim/speedup-gdn-dec branch March 11, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PERF] Speed-up of GDN attention decode part (Qwen3-Next)#31722

[PERF] Speed-up of GDN attention decode part (Qwen3-Next)#31722
mgoin merged 3 commits intovllm-project:mainfrom
CentML:vadim/speedup-gdn-dec

vadiklyutiy commented Jan 5, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

vadiklyutiy commented Jan 5, 2026 •

edited

Loading

Uh oh!

ZJY0516 left a comment

Uh oh!

vadiklyutiy commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

vadiklyutiy commented Jan 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

H200 — fused_recurrent_gated_delta_rule_fwd with shapes from Qwen3-Next

Before

After

B200 — fused_recurrent_gated_delta_rule_fwd with shapes from Qwen3-Next

Before

After

End-to-End Decode

Results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

vadiklyutiy commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vadiklyutiy commented Jan 5, 2026 •

edited by github-actions bot

Loading

H200 — `fused_recurrent_gated_delta_rule_fwd` with shapes from Qwen3-Next

B200 — `fused_recurrent_gated_delta_rule_fwd` with shapes from Qwen3-Next

vadiklyutiy commented Jan 5, 2026 •

edited

Loading