Skip to content

[PERF] Speed-up of GDN attention decode part (Qwen3-Next)#31722

Merged
mgoin merged 3 commits intovllm-project:mainfrom
CentML:vadim/speedup-gdn-dec
Jan 6, 2026
Merged

[PERF] Speed-up of GDN attention decode part (Qwen3-Next)#31722
mgoin merged 3 commits intovllm-project:mainfrom
CentML:vadim/speedup-gdn-dec

Conversation

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

@vadiklyutiy vadiklyutiy commented Jan 5, 2026

Speed Up GDN Attention (Decode) - fused_recurrent_gated_delta_rule_fwd

Benchmarks

H200 — fused_recurrent_gated_delta_rule_fwd with shapes from Qwen3-Next

Before

Batch Size TP=4 (μs) TP=2 (μs)
32 19.87 ± 0.08 30.14 ± 0.02
128 52.06 ± 0.02 93.38 ± 0.02
256 94.24 ± 0.02 176.10 ± 0.02
512 177.73 ± 0.03 341.63 ± 0.03
1024 345.12 ± 0.20 672.48 ± 0.05

After

Batch Size TP=4 (μs) TP=2 (μs)
32 13.60 ± 0.07 20.29 ± 0.02
128 33.76 ± 0.02 59.17 ± 0.02
256 60.29 ± 0.02 111.14 ± 0.36
512 113.28 ± 0.05 212.70 ± 0.11
1024 215.12 ± 0.13 420.22 ± 0.20

B200 — fused_recurrent_gated_delta_rule_fwd with shapes from Qwen3-Next

Before

Batch Size TP=4 (μs) TP=2 (μs)
32 12.26 ± 0.02 19.23 ± 0.01
128 34.91 ± 0.01 63.58 ± 0.01
256 64.80 ± 0.01 122.30 ± 0.01
512 124.64 ± 0.02 239.23 ± 0.02
1024 244.60 ± 0.02 473.53 ± 0.05

After

Batch Size TP=4 (μs) TP=2 (μs)
32 8.22 ± 0.02 14.11 ± 0.01
128 24.54 ± 0.01 42.21 ± 0.01
256 42.11 ± 0.02 80.29 ± 0.02
512 80.80 ± 0.03 156.64 ± 0.06
1024 154.94 ± 0.07 305.21 ± 0.21

End-to-End Decode

Server Launch

VLLM_USE_FLASHINFER_MOE_FP8=1 \
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  -tp 2 \
  --enable-expert-parallel \
  --async-scheduling \
  --no-enable-prefix-caching \
  --compilation_config.max_cudagraph_capture_size 2048

Benchmark Command

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 32 \
  --random-output 1000 \
  --max-concurrency 1024 \
  --num-prompt 1024 \
  --ignore-eos

Results

Before

  • Output token throughput: 23891 tok/s

After

  • Output token throughput: 27904 tok/s

Speedup: 16.8%

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy requested a review from youkaichao January 5, 2026 12:38
@mergify mergify bot added the qwen Related to Qwen models label Jan 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to speed up the GDN attention decode operation by increasing the block size for the V dimension (BV) in the Triton kernel from 8 to 32. The provided benchmarks show significant performance improvements on H200 and B200 hardware.

While this is a great optimization for newer GPU architectures, the increased BV can lead to excessive shared memory usage (BK * BV * 4 bytes), potentially causing runtime errors on GPUs with more limited shared memory per block, such as the NVIDIA T4 (Turing architecture, CC 7.5), which has a 48KB limit.

I've added a review comment with a suggestion to dynamically adjust the BV limit based on the GPU's compute capability to maintain compatibility with a wider range of hardware while still enabling the performance benefits on capable devices.

@vadiklyutiy
Copy link
Copy Markdown
Collaborator Author

vadiklyutiy commented Jan 5, 2026

cc @ZJY0516, @sighingnow

@vadiklyutiy vadiklyutiy requested a review from sighingnow January 5, 2026 12:43
@vadiklyutiy vadiklyutiy self-assigned this Jan 5, 2026
Copy link
Copy Markdown
Member

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mgoin mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed labels Jan 5, 2026
@mgoin mgoin enabled auto-merge (squash) January 5, 2026 20:11
@vadiklyutiy
Copy link
Copy Markdown
Collaborator Author

Seems multi-modal-processor-test-cpu fails on top of the tree

image

@mgoin mgoin merged commit 22dffca into vllm-project:main Jan 6, 2026
47 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
…ct#31722)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…ct#31722)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…ct#31722)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…ct#31722)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy deleted the vadim/speedup-gdn-dec branch March 11, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants