[PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] by vadiklyutiy · Pull Request #33291 · vllm-project/vllm

vadiklyutiy · 2026-01-28T23:41:01Z

Summary

This PR changes the recurrent state memory layout in GDN (Gated Delta Net) attention from [N, HV, K, V] to [N, HV, V, K] for improved memory access patterns and throughput.

Behind speedup, also allows to use FI's GDN kernels

Performance Results

Model: nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 (TP=2)

Server:

VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
    -tp 2 --enable-expert-parallel --async-scheduling --no-enable-prefix-caching \
    --compilation_config.max_cudagraph_capture_size 2048

Benchmark:

vllm bench serve --backend vllm --model nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
--endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1000 \
--max-concurrency $CONC --num-prompt $CONC --ignore-eos

Batch Size	Baseline (tok/s)	With PR (tok/s)	Delta
1	199.58	201.37	+0.9%
16	2,251.70	2,225.81	-1.2%
64	6,148.08	6,088.84	-1.0%
256	14,420.40	14,620.51	+1.4%
1024	23,245	24,350	+4.8%

Correctness Verification (lm_eval)

Task: GSM8K (5-shot)
Model: nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4

Server:

VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
-tp 2 --enable-expert-parallel --async-scheduling --no-enable-prefix-caching \
--compilation_config.max_cudagraph_capture_size 2048

Evaluation:

lm_eval --model local-chat-completions \
    --model_args model=nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=250 \
    --tasks gsm8k --apply_chat_template --num_fewshot 5 --output_path ./eval_results --log_samples

Metric	Baseline	With PR	Delta
exact_match (flexible-extract)	0.7703	0.7718	+0.0015
exact_match (strict-match)	0.6406	0.6368	-0.0038

With speculative decoding.

Unfortunatelly we have a problem in case spec decoding+cudagraph. Run without cudagraph. Also used local-completions insted of above local-chat-completions - that produce better accuracy.

Server:

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 -tp 4     --enable-expert-parallel --async-scheduling --no
-enable-prefix-caching     --compilation_config.cudagraph_mode NONE    --speculative_config.method qwen3_next_mtp     --speculative_config.num_speculative_toke
ns 3

Evaluation:

lm_eval --model local-completions --tasks gsm8k   --model_args base_url=http://localhost:8000/v1/completions,model=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8,num_concurrent=109;

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8537	±	0.0097
		strict-match	5	exact_match	↑	0.8143	±	0.0107

Result the same as baseline.

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a performance optimization by changing the memory layout of the GDN attention state from [N, HV, K, V] to [N, HV, V, K]. This change is aimed at improving memory access patterns and throughput. The modifications are consistently applied across documentation, examples, and Triton kernel implementations. The kernel logic has been correctly adapted to the new layout, including the use of transpositions where necessary. The provided performance benchmarks and correctness verification results support the effectiveness and validity of this change. The code appears to be correct and well-implemented.

vadiklyutiy · 2026-01-28T23:43:28Z

cc @ZJY0516

mergify · 2026-01-29T18:48:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

pavanimajety

Could we do a perf comparison against multiple batch sizes? Also does this change naturally work for spec decode too?

pavanimajety · 2026-02-03T00:22:03Z

        b_v = b_v.to(k.dtype.element_ty)

        p_k = tl.make_block_ptr(
            k, (K, T), (1, stride_k), (0, i_t * BT), (64, BT), (0, 1)
        )
        b_k = tl.load(p_k, boundary_check=(0, 1))
-        b_h1 += tl.dot(b_k, b_v)
+        b_h1 += tl.trans(tl.dot(b_k, b_v))


do we see as much speedup for larger batch sizes too with these additional transposes?

What do you mean? In description I wrote result for batch=1024

Thanks for the clarification, I misread it as num-prompt 32. IMO we should still have performance across a range of batch sizes.

I added to description comparison with several additional batch sizes

pavanimajety · 2026-02-03T00:34:16Z

@@ -55,7 +55,7 @@ def fused_recurrent_kda_fwd(
    if inplace_final_state:
        final_state = initial_state


Does the ssm_state / kv_cache also need to be created in the N, HV, V, K layout?

This layout change is layout of ssm_state/kv_cache.
Did I miss some place where I should change it?

Just checking whether anything needs to change when the kv_cache is initially created. If the current setup yields good accuracy, it should be fine. Let's double-check with spec decode since it’s currently supported.

I ran spec decoding and added results to description. No accuracy lost with spec decoding.

vadiklyutiy · 2026-02-03T20:40:22Z

Could we do a perf comparison against multiple batch sizes? Also does this change naturally work for spec decode too?

I added to description comparison with several additional batch sizes

zhiyuan1i · 2026-02-04T01:12:41Z

Thanks for your contributions, overall LGTM.
I understand that this will affect the throughput more in the decode case and not so much on the prefill, right?
I see a certain negative impact on your PR for the small bsz interval, can we look into the reason for this at the same time.
For me, it would be better to look at the performance impact of prefill and decode separately if there were a kernel-level benchmark.

vadiklyutiy · 2026-02-04T01:38:21Z

I understand that this will affect the throughput more in the decode case and not so much on the prefill, right?

It is indirectly impact prefill. See #32846 that try to enable FlashInfer prefill. But this kernel use the new layout.
In general several kernel experts have been working on GDN prefill and decode kernels and independently came to conclusion that such layout is better.

I see a certain negative impact on your PR for the small bsz interval, can we look into the reason for this at the same time.

This is fluctuation. If you make several runs of the vllm bench you will got 1-2% fluctuations.

youkaichao

thanks for the contribution, LGTM 👍 since @zhiyuan1i agrees

…, V, K] (vllm-project#33291) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>

leuasseurfarrelds247-arch · 2026-03-05T09:46:39Z

[ ]

…, V, K] (vllm-project#33291) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

save smm state in [...,V,K] instead of [...,K,V]

acdc18c

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy requested a review from tdoublep as a code owner January 28, 2026 23:41

vadiklyutiy requested review from sighingnow and youkaichao and removed request for tdoublep January 28, 2026 23:41

gemini-code-assist Bot reviewed Jan 28, 2026

View reviewed changes

vadiklyutiy changed the title ~~[PERF] PR: Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K]~~ [PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] Jan 28, 2026

vadiklyutiy mentioned this pull request Jan 28, 2026

[Kernel] use flashinfer for gdn prefill #32846

Merged

5 tasks

mergify Bot added the needs-rebase label Jan 29, 2026

vadiklyutiy mentioned this pull request Jan 31, 2026

[Tracking Issue]: Qwen3-next performance optimisations #27225

Closed

12 tasks

mergify Bot removed the needs-rebase label Feb 1, 2026

pavanimajety reviewed Feb 3, 2026

View reviewed changes

Merge branch 'main' into vadim/gdn-kv-2-vk

6db5d35

vadiklyutiy force-pushed the vadim/gdn-kv-2-vk branch from c9ecabd to 6db5d35 Compare February 3, 2026 21:09

zhiyuan1i approved these changes Feb 4, 2026

View reviewed changes

youkaichao approved these changes Feb 4, 2026

View reviewed changes

youkaichao enabled auto-merge (squash) February 4, 2026 02:29

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2026

youkaichao merged commit 8240580 into vllm-project:main Feb 4, 2026
45 checks passed

ZJY0516 mentioned this pull request Feb 5, 2026

[Bug] GDN prefill kernel produces NaN flashinfer-ai/flashinfer#2490

Closed

leuasseurfarrelds247-arch approved these changes Mar 5, 2026

View reviewed changes

vadiklyutiy deleted the vadim/gdn-kv-2-vk branch March 11, 2026 08:00

ChenxiQ mentioned this pull request Apr 27, 2026

[Bugfix] correct h matrix layout in chunk_kda output kernel #40956

Merged

4 tasks

yudigege86 mentioned this pull request Apr 29, 2026

[Bug]: KDA chunked prefill uses wrong recurrent state layout and breaks Kimi-linear long-context retrieval #41292

Closed

1 task

sighingnow mentioned this pull request May 6, 2026

[GDN] Enable FI Blackwell GDN prefill kernel #40717

Open

4 tasks

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

[PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV…

e55c84c

…, V, K] (vllm-project#33291) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

		@@ -55,7 +55,7 @@ def fused_recurrent_kda_fwd(
		if inplace_final_state:
		final_state = initial_state

Uh oh!

Conversation

vadiklyutiy commented Jan 28, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance Results

Correctness Verification (lm_eval)

With speculative decoding.

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

vadiklyutiy commented Jan 28, 2026

Uh oh!

mergify Bot commented Jan 29, 2026

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Feb 3, 2026

Uh oh!

zhiyuan1i commented Feb 4, 2026

Uh oh!

vadiklyutiy commented Feb 4, 2026

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leuasseurfarrelds247-arch commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vadiklyutiy commented Jan 28, 2026 •

edited by github-actions Bot

Loading