[Perf] add packed recurrent fast path for decode by caozuoba · Pull Request #36596 · vllm-project/vllm

caozuoba · 2026-03-10T06:26:13Z

Purpose

Add a packed recurrent decode fast path for Qwen3Next GDN non-spec uniform decode (T=1).
Directly consume packed mixed_qkv in a decode-only fast path instead of materializing contiguous q/k/v.
Fuse the packed data flow with recurrent decode state update/output writeback to reduce intermediate memory traffic in the decode hot path.
Guard the fast path with VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE (default: 0).

Test Result

Correctness (pytest)

Command

python3 -m pytest -q tests/kernels/test_fused_recurrent_packed_decode.py

Result

....                                                                                                                                                   [100%]
4 passed in 8.55s

Performance

Compared to the main branch, this PR improves Output token throughput (tok/s) by ~6.37%, reduces Mean TPOT (ms) by ~5.34%, reduces Mean E2EL (ms) by ~6.56%, and reduces Mean TTFT (ms) by ~9.79%.

Command

python3 -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 19000 \
  --dtype bfloat16 \
  --model /nas/disk1/Qwen3.5-35B-A3B \
  --served-model-name Qwen3.5-35B-A3B \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --no-enable-prefix-caching

Main：

============ Serving Benchmark Result ============
Successful requests:                     800
Failed requests:                         0
Benchmark duration (s):                  9.37
Total input tokens:                      102400
Total generated tokens:                  80000
Request throughput (req/s):              85.34
Output token throughput (tok/s):         8533.93
Peak output token throughput (tok/s):    13948.00
Peak concurrent requests:                800.00
Total token throughput (tok/s):          19457.36
---------------Time to First Token----------------
Mean TTFT (ms):                          2289.22
Median TTFT (ms):                        2349.07
P99 TTFT (ms):                           8254.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          62.00
Median TPOT (ms):                        61.78
P99 TPOT (ms):                           69.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.63
Median ITL (ms):                         56.47
P99 ITL (ms):                            343.68
----------------End-to-end Latency----------------
Mean E2EL (ms):                          8427.53
Median E2EL (ms):                        8461.47
P99 E2EL (ms):                           9300.76
==================================================

PR：

============ Serving Benchmark Result ============
Successful requests:                     800
Failed requests:                         0
Benchmark duration (s):                  8.81
Total input tokens:                      102400
Total generated tokens:                  80000
Request throughput (req/s):              90.78
Output token throughput (tok/s):         9077.62
Peak output token throughput (tok/s):    14744.00
Peak concurrent requests:                800.00
Total token throughput (tok/s):          20696.97
---------------Time to First Token----------------
Mean TTFT (ms):                          2064.99
Median TTFT (ms):                        1807.52
P99 TTFT (ms):                           7798.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.69
Median TPOT (ms):                        61.33
P99 TPOT (ms):                           66.13
---------------Inter-token Latency----------------
Mean ITL (ms):                           58.25
Median ITL (ms):                         53.12
P99 ITL (ms):                            310.54
----------------End-to-end Latency----------------
Mean E2EL (ms):                          7874.99
Median E2EL (ms):                        7882.78
P99 E2EL (ms):                           8737.92
==================================================

Signed-off-by: hdj <1293066020@qq.com>

mergify · 2026-03-10T06:31:06Z

Hi @caozuoba, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization for Qwen3Next models by adding a packed recurrent fast path for the decode phase. The changes are well-structured and include a new Triton kernel that directly consumes a packed QKV tensor, which avoids materializing separate Q, K, and V tensors and fuses the gating logic. This new functionality is controlled by an environment variable and includes robust fallback mechanisms to the baseline implementation, ensuring safety. The pull request also adds a new test to verify the correctness of the new kernel. The integration into the existing model code is clean and follows established patterns. Overall, the changes appear to be correct, safe, and a valuable performance enhancement.

Signed-off-by: hdj <1293066020@qq.com>

caozuoba · 2026-03-10T15:50:49Z

@mgoin @tlrmchlsmth @WoosukKwon @ZJY0516 Hello everyone, could you please take a look at this PR and provide some feedback when you have a moment? This is a follow-up to a PR #35739 submitted last week. Due to some conflicts with an already merged PR, I’ve created a new PR and re-run the benchmark tests. Thanks!

ZJY0516

please add accuracy test

ZJY0516 · 2026-03-11T03:56:30Z

vllm/model_executor/layers/fla/ops/fused_recurrent.py

+    tl.store(p_ht, b_h.to(p_ht.dtype.element_ty), mask=mask_h)
+
+
+def fused_recurrent_gated_delta_rule_packed_decode_fwd(


Suggested change

def fused_recurrent_gated_delta_rule_packed_decode_fwd(

def fused_recurrent_gated_delta_rule_packed_decode(

I prefer this name

ZJY0516 · 2026-03-11T04:01:37Z

vllm/model_executor/models/qwen3_next.py

+                use_qk_l2norm_in_kernel=True,
+            )
+            return
+        except ValueError as exc:


When this will fail? I don't think this needs try and except here

ZJY0516 · 2026-03-11T04:02:22Z

vllm/model_executor/models/qwen3_next.py

        else:
            core_attn_out[:num_actual_tokens] = core_attn_out_non_spec.squeeze(0)

+    def _forward_core_packed(


Suggested change

def _forward_core_packed(

def _forward_core_decode_non_spec(

ZJY0516 · 2026-03-11T04:03:11Z

vllm/envs.py

    if "VLLM_DISABLED_KERNELS" not in os.environ
    else os.environ["VLLM_DISABLED_KERNELS"].split(","),
+    "VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE": lambda: bool(
+        int(os.getenv("VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE", "0"))


Could we enable this by default?

Signed-off-by: hdj <1293066020@qq.com>

caozuoba · 2026-03-11T06:21:40Z

Accuracy Testing

Command

python3 -m lm_eval --model local-completions \
  --model_args model=Qwen3.5-35B-A3B,base_url=http://127.0.0.1:19000/v1/completions,num_concurrent=80,tokenizer=/nas/disk1/Qwen3.5-35B-A3B \
  --tasks gsm8k

Baseline

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8529|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8370|±  |0.0102|

This PR

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8582|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.8431|±  |0.0100|

caozuoba · 2026-03-11T06:25:27Z

@ZJY0516 Hi, I’ve addressed the review comments and added an accuracy test.

When you have time, could you please take another look and share your thoughts on the next step?

ZJY0516

LGTM now

ZJY0516 · 2026-03-11T06:26:53Z

vllm/model_executor/models/qwen3_next.py

        else:
            core_attn_out[:num_actual_tokens] = core_attn_out_non_spec.squeeze(0)

+    def _forward_core_decode_non_spec(


Can we switch to this inside _forward_core instead of falling back inside this decode method?

Can we switch to this inside _forward_core instead of falling back inside this decode method?

Good point, thanks. Moving this selection into _forward_core does make the flow cleaner.

Let me update that.

Can we switch to this inside _forward_core instead of falling back inside this decode method?

Updated accordingly, thanks!

ZJY0516 · 2026-03-11T06:36:03Z

also cc @vadiklyutiy

Signed-off-by: hdj <1293066020@qq.com>

ZJY0516 · 2026-03-12T01:56:59Z

vllm/model_executor/models/qwen3_next.py

+        attn_metadata = attn_metadata[self.prefix]
+        assert isinstance(attn_metadata, GDNAttentionMetadata)
+
+        if (


I prefer this:

if is_decode: return self._forward_core_decode_non_spec # oringinal logic here

No need to introduce _forward_core_baseline

I prefer this:

if is_decode: return self._forward_core_decode_non_spec # oringinal logic here

No need to introduce _forward_core_baseline

Makes sense — updated accordingly. Thanks.

Signed-off-by: hdj <1293066020@qq.com>

ZJY0516

Thanks for contributing

caozuoba · 2026-03-12T03:26:23Z

@ZJY0516 Thanks again for the review and approval! I also have a draft follow-up PR for the spec path, and I’m planning to submit it soon. I’d really appreciate it if you could take a look when you have time.

caozuoba · 2026-03-12T07:21:57Z

@ywang96 @ZJY0516 It looks like some of the failed checks are due to 403 / permission issues rather than the code change itself.When you have time, could you please help rerun those failed checks from the maintainer side?

caozuoba · 2026-03-12T08:42:31Z

It looks like one check is still consistently hitting a 403. Do you happen to know what might be causing that, or if there’s anything I should do on my side? @ZJY0516 @ywang96

mgoin

LGTM, I just don't see the need for an env var

mgoin · 2026-03-12T09:42:27Z

vllm/envs.py

+    "VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE": lambda: bool(
+        int(os.getenv("VLLM_ENABLE_FLA_PACKED_RECURRENT_DECODE", "1"))
+    ),


Why do we need this env var at all if it is enabled by default?

Yes, no need to add this

Why do we need this env var at all if it is enabled by default?

Good point. I plan to clean this up together with the spec-path follow-up so both paths stay consistent.

caozuoba · 2026-03-12T11:00:10Z

LGTM, I just don't see the need for an env var

@mgoin When you have time, could you please help retrigger that failed CI check? It seems to have hit a 403 a few times already.

mgoin · 2026-03-12T11:01:40Z

We can just force merge for now

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: EricccYang <yangyang4991@gmail.com>

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>

caozuoba added 2 commits March 10, 2026 11:52

fla: add packed recurrent decode fast path

3036a74

Signed-off-by: hdj <1293066020@qq.com>

tests: fix packed recurrent decode reference call

2e7b940

Signed-off-by: hdj <1293066020@qq.com>

caozuoba requested review from WoosukKwon, mgoin, sighingnow, tlrmchlsmth and yewentao256 as code owners March 10, 2026 06:26

mergify bot added the qwen Related to Qwen models label Mar 10, 2026

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

style: ruff format

9bdba19

Signed-off-by: hdj <1293066020@qq.com>

caozuoba mentioned this pull request Mar 10, 2026

[core]add gdn packed decode path #35739

Open

Merge branch 'main' into perf/gdn-packed

cafc032

ZJY0516 reviewed Mar 11, 2026

View reviewed changes

gdn: address review feedback

297a3f8

Signed-off-by: hdj <1293066020@qq.com>

ZJY0516 reviewed Mar 11, 2026

View reviewed changes

gdn: move decode path routing into forward core

6ba4d35

Signed-off-by: hdj <1293066020@qq.com>

ZJY0516 reviewed Mar 12, 2026

View reviewed changes

refactor: inline baseline logic in forward core

1d4dafa

Signed-off-by: hdj <1293066020@qq.com>

ZJY0516 approved these changes Mar 12, 2026

View reviewed changes

Merge branch 'main' into perf/gdn-packed

d95cfdf

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 12, 2026

ywang96 approved these changes Mar 12, 2026

View reviewed changes

Merge branch 'main' into perf/gdn-packed

e17186a

mgoin approved these changes Mar 12, 2026

View reviewed changes

vllm-bot merged commit 9e19f83 into vllm-project:main Mar 12, 2026
56 of 58 checks passed

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[Perf] add packed recurrent fast path for decode (vllm-project#36596)

50ffb61

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Perf] add packed recurrent fast path for decode (vllm-project#36596)

ac6db7f

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>

ywang96 mentioned this pull request Mar 19, 2026

[Perf] add cute dsl kernel for gdn decode #36111

Closed

5 tasks

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Perf] add packed recurrent fast path for decode (vllm-project#36596)

aa5a51d

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Perf] add packed recurrent fast path for decode (vllm-project#36596)

774939f

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026

[Perf] add packed recurrent fast path for decode (vllm-project#36596)

002a8ba

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026

[Perf] add packed recurrent fast path for decode (vllm-project#36596)

dca06ed

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>

big-yellow-duck pushed a commit to EmbeddedLLM/vllm that referenced this pull request Apr 8, 2026

[Perf] add packed recurrent fast path for decode (vllm-project#36596)

6c45233

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[Perf] add packed recurrent fast path for decode (vllm-project#36596)

ccc76f2

Signed-off-by: hdj <1293066020@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io>

		tl.store(p_ht, b_h.to(p_ht.dtype.element_ty), mask=mask_h)


		def fused_recurrent_gated_delta_rule_packed_decode_fwd(

Uh oh!

Conversation

caozuoba commented Mar 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Result

Correctness (pytest)

Performance

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

caozuoba commented Mar 10, 2026

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caozuoba commented Mar 11, 2026

Uh oh!

caozuoba commented Mar 11, 2026

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZJY0516 left a comment

Choose a reason for hiding this comment

Uh oh!

caozuoba commented Mar 12, 2026

Uh oh!

caozuoba commented Mar 12, 2026

Uh oh!

caozuoba commented Mar 12, 2026

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caozuoba commented Mar 12, 2026

Uh oh!

mgoin commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

caozuoba commented Mar 10, 2026 •

edited by github-actions bot

Loading