[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 by wxsIcey · Pull Request #37975 · vllm-project/vllm

wxsIcey · 2026-03-24T07:56:21Z

Purpose

Move the GDN (Gated Delta Net) layer implementation from qwen3_next.py into a dedicated gdn_linear_attn.py, and unify Qwen3Next and Qwen3.5 under a single GatedDeltaNetAttention class.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist · 2026-03-24T08:00:38Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

wxsIcey · 2026-03-24T08:24:28Z

Since key-value in-contiguous are not supported in xpu and npu, the operators of the GatedDeltaNetAttention layer must be rewritten in xpu and npu:
#33657
vllm-project/vllm-ascend#6640

For in-tree platform dispatch, I currently do not have a good solution. For out-of-tree platform dispatch, the PluggableLayerapproach can be used. Therefore, this refactoring is proposed.

cc @ZJY0516

ZJY0516 · 2026-03-24T08:33:25Z

cc @vadiklyutiy @jikunshang @tdoublep

vadiklyutiy · 2026-03-24T08:38:03Z

This is a very important and valuable change! The GDN implementation was quite messy—thank you very much for your contribution. Will take a look later today.

vadiklyutiy · 2026-03-24T08:52:17Z

/gemini review

gemini-code-assist · 2026-03-24T08:59:54Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

wxsIcey · 2026-03-24T09:01:47Z

/gemini review

ZJY0516 · 2026-03-24T09:04:27Z

@codex review

gemini-code-assist

Code Review

The Gated Delta Net (GDN) attention implementation, including its custom operations and Triton kernels, has been refactored into a new dedicated file, gdn_linear_attn.py. This new GatedDeltaNetAttention class now serves as a unified implementation for both Qwen3-Next and Qwen3.5 models, replacing the previously separate GDN classes and handling model-specific configurations like GQA interleaved layouts and LoRA compatibility through parameters. A critical issue was identified in the fix_query_key_value_ordering method, where the new_tensor_shape_ba is incorrectly derived from mixed_qkvz.size() instead of mixed_ba.size(), which could lead to a runtime error if the number of tokens differs between these tensors.

vllm/model_executor/layers/mamba/gdn_linear_attn.py

wangxiyuan · 2026-03-24T09:48:03Z

yes, we need it for long time.

ZJY0516 · 2026-03-24T09:50:34Z

please test qwen 3.5, qwen 3 next and lora

vllm/model_executor/models/qwen3_next.py

wxsIcey · 2026-03-24T11:37:56Z

please test qwen 3.5, qwen 3 next and lora

OK, I will add it.

yma11 · 2026-03-24T13:32:32Z

please test qwen 3.5, qwen 3 next and lora

OK, I will add it.

I tested this refactor on XPU platform, Qwen3.5-9B shows accuracy issue. Can you check?

wxsIcey · 2026-03-24T13:44:42Z

please test qwen 3.5, qwen 3 next and lora

OK, I will add it.

I tested this refactor on XPU platform, Qwen3.5-9B shows accuracy issue. Can you check?

I just performed a simple test on the A100, and the output is normal. I don't have an XPU machine for testing. What are your test cases and outputs? Or did you perform an accuracy test?

import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams

def main():
    prompts = [
        "The future of AI is",
        "Who is the President of the United States?",
        ]
    sampling_params = SamplingParams(temperature=0.8)

    llm = LLM(
        model="/shared/models/modelscope/models/models/Qwen/Qwen3.5-9B",
        tensor_parallel_size=1,
        gpu_memory_utilization=0.9,
        max_model_len=8092,
    )

    outputs = llm.generate(prompts, sampling_params=sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Prompt: 'The future of AI is', Generated text: ' here\n\nAI is an immensely powerful tool, but it needs to be managed.'
Prompt: 'Who is the President of the United States?', Generated text: '\n\n**Joe Biden** is the President of the United States. He assumed office'

jikunshang

Thanks for refactoring. just some minor comments from my side.

vllm/model_executor/layers/mamba/gdn_linear_attn.py

jikunshang · 2026-03-24T14:09:31Z

vllm/model_executor/layers/mamba/gdn_linear_attn.py

+            use_qk_l2norm_in_kernel=use_qk_l2norm_in_kernel,
+        )
+
+    def forward_native(


ideally, forward_native should be a torch-native impl, so every platform could leverage. using triton here cpu platform will throw error. I am ok to keep this. just some minor concern. cc @bigPYJ1151

I don't think we need a torch-native impl, just like no torch-native flash attn in vllm

Agree we don't need it here. it's just about naming, maybe we should rename to forward_triton to avoid confusion.
My understanding is forward_native in CustomOp should be a torch-native impl. https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/custom_op.py#L138-L144

I agree that forward_native should be a torch-native implementation，so using triton here is not reasonable. However, CustomOp is a platform-specific forward dispatch ( https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/custom_op.py#L196-L207 ). forward_triton does not have this functionality. I think the best practice is the vLLM IR proposed by Luka ( #32358 ), we can define triton as an IR kernel and specify its platform-wide usage.

So we can wait for the IR PR to be merged and the code to be refactored. Is that acceptable?

cc @ProExpertProg

I think we can do this later — it doesn't necessarily have to be done in this PR.

vllm/model_executor/layers/mamba/gdn_linear_attn.py

yma11 · 2026-03-25T01:09:51Z

please test qwen 3.5, qwen 3 next and lora

OK, I will add it.

I tested this refactor on XPU platform, Qwen3.5-9B shows accuracy issue. Can you check?

I just performed a simple test on the A100, and the output is normal. I don't have an XPU machine for testing. What are your test cases and outputs? Or did you perform an accuracy test?

import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams

def main():
    prompts = [
        "The future of AI is",
        "Who is the President of the United States?",
        ]
    sampling_params = SamplingParams(temperature=0.8)

    llm = LLM(
        model="/shared/models/modelscope/models/models/Qwen/Qwen3.5-9B",
        tensor_parallel_size=1,
        gpu_memory_utilization=0.9,
        max_model_len=8092,
    )

    outputs = llm.generate(prompts, sampling_params=sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Prompt: 'The future of AI is', Generated text: ' here\n\nAI is an immensely powerful tool, but it needs to be managed.'
Prompt: 'Who is the President of the United States?', Generated text: '\n\n**Joe Biden** is the President of the United States. He assumed office'

Thanks for check. Then maybe our platform specific issue. Let me take a look further.

mergify · 2026-03-26T04:02:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wxsIcey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…t and Qwen3.5 Signed-off-by: wxsIcey <1790571317@qq.com>

Signed-off-by: wxsIcey <1790571317@qq.com>

Signed-off-by: Icey <1790571317@qq.com>

Signed-off-by: wxsIcey <1790571317@qq.com>

Signed-off-by: Icey <1790571317@qq.com>

Signed-off-by: wxsIcey <1790571317@qq.com>

Signed-off-by: Icey <1790571317@qq.com>

mergify · 2026-03-26T08:44:39Z

Hi @wxsIcey, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

jikunshang · 2026-03-26T08:45:31Z

I triggered LM Eval Qwen3.5 Models (B200) case in CI
https://buildkite.com/vllm/ci/builds/58211#019d2948-6733-4b4a-846f-92163a24ad03

Signed-off-by: Icey <1790571317@qq.com>

wxsIcey · 2026-03-27T01:23:34Z

I triggered LM Eval Qwen3.5 Models (B200) case in CI https://buildkite.com/vllm/ci/builds/58211#019d2948-6733-4b4a-846f-92163a24ad03

Thank you. All tests have passed. Is it possible to get approval?

yma11 · 2026-03-27T01:28:14Z

@jikunshang PTAL again.

jikunshang · 2026-03-27T01:50:40Z

@claude review

claude

A few things still need attention before this can merge: there is an unresolved reviewer question on gdn_linear_attn.py, pre-commit was failing as of March 26 (unclear if fixed), and the gdn_in_proj stub calls an undefined _forward_in_proj method (see inline comment).

Extended reasoning...

Overview

This PR extracts GatedDeltaNetAttention from qwen3_next.py into a new shared gdn_linear_attn.py and removes the Qwen3NextGatedDeltaNet/Qwen3_5GatedDeltaNet subclass hierarchy in favour of a single parameterised class. It touches ~1000 lines across three files, primarily a refactor but with meaningful behavioural changes.

Security Risks

No security-sensitive code paths are touched. Risk is limited to model correctness and inference performance.

Level of Scrutiny

The change is architecturally meaningful as the foundation for out-of-tree platform dispatch unifying two previously divergent implementations. LM Eval on B200 passed, but several reviewer concerns remain open. This warrants human approval rather than bot shadow-approval.

Other Factors

Unresolved reviewer comment - The inline question on gdn_linear_attn.py (comment id 2981910661, "is this necessary?") is not marked resolved in the timeline.
Pre-commit failure - Mergify reported a pre-commit failure on March 26; it is not confirmed whether this was subsequently fixed.
gdn_in_proj dead code - The stub function at line 950 calls self._forward_in_proj which does not exist on GatedDeltaNetAttention. While currently unreachable dead code, it is a footgun for future platform-plugin authors (see inline comment for details).
forward_native naming - ChunkGatedDeltaRule.forward_native actually invokes a Triton/FLA kernel, violating the CustomOp convention. The team agreed to defer this to a follow-up.

claude · 2026-03-27T02:06:44Z

vllm/model_executor/layers/mamba/gdn_linear_attn.py

+    layer_name: str,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """
+    Custom op for the input projection.
+    """
+    forward_context: ForwardContext = get_forward_context()
+    self = forward_context.no_compile_layers[layer_name]
+    return self._forward_in_proj(hidden_states)
+
+
+def gdn_attention_core(
+    mixed_qkv: torch.Tensor,


🟡 The gdn_in_proj function (line 950) calls self._forward_in_proj(hidden_states), but GatedDeltaNetAttention defines no _forward_in_proj method — this would raise AttributeError if invoked. The function is also never registered via direct_register_custom_op (unlike gdn_attention_core), making it unreachable dead code. The _qkvz_output_size helper (with docstring "for gdn_in_proj fake impl") is similarly orphaned and should be removed or completed.

Extended reasoning...

The gdn_in_proj function defined at lines 946–957 of vllm/model_executor/layers/mamba/gdn_linear_attn.py is an incomplete custom-op stub left over from the refactoring. It retrieves the layer from the forward context and then calls self._forward_in_proj(hidden_states), but GatedDeltaNetAttention (and none of its base classes — PluggableLayer, MambaBase) defines any _forward_in_proj method. A grep of the entire codebase confirms only one occurrence of _forward_in_proj: the call site at line 957 itself.

The companion _qkvz_output_size method at line 668 has a docstring that explicitly reads "for gdn_in_proj fake impl", confirming that the author intended this to become a full custom op (analogous to the working gdn_attention_core op), with a real implementation and a fake/shape-only implementation for torch.compile. That work was never completed: direct_register_custom_op is called only for gdn_attention_core, never for gdn_in_proj.

As things stand, gdn_in_proj is unreachable dead code. No code path in the codebase calls it directly, and it is not registered as a torch.ops.vllm.* custom op that could be dispatched to. So there is no runtime failure today.

However, the purpose of PluggableLayer is precisely to allow out-of-tree platform plugins (e.g., XPU, NPU) to register alternative forward implementations. If a plugin author discovers gdn_in_proj and tries to wire it up as a custom op for the input-projection step, the stub will crash with AttributeError: GatedDeltaNetAttention object has no attribute _forward_in_proj the first time it is called. This makes the code a footgun for future contributors and platform integrators.

Step-by-step proof:

A platform plugin registers gdn_in_proj via direct_register_custom_op pointing at gdn_in_proj.

The plugin arranges for the forward pass to call torch.ops.vllm.gdn_in_proj(hidden_states, qkvz_size, ba_size, self.prefix).

gdn_in_proj runs, resolves self from the forward context, then executes self._forward_in_proj(hidden_states).

Python raises AttributeError: GatedDeltaNetAttention object has no attribute _forward_in_proj.

Fix: Either (a) implement _forward_in_proj on GatedDeltaNetAttention and register gdn_in_proj with direct_register_custom_op plus a gdn_in_proj_fake shape function, or (b) remove gdn_in_proj and _qkvz_output_size entirely if the custom-op abstraction for input projection is not needed in this PR.

Thanks for the review. I forgot to remove the unnecessary gdn_in_proj when resolving the conflict at #38152. It has been fixed.

Signed-off-by: Icey <1790571317@qq.com>

wxsIcey · 2026-03-27T02:33:42Z

@claude review

…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com>

…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com>

mergify bot added the qwen Related to Qwen models label Mar 24, 2026

wxsIcey marked this pull request as ready for review March 24, 2026 08:24

wxsIcey requested review from sighingnow and tdoublep as code owners March 24, 2026 08:24

wxsIcey mentioned this pull request Mar 24, 2026

[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 vllm-project/vllm-ascend#7581

Open

gemini-code-assist bot reviewed Mar 24, 2026

View reviewed changes

vllm/model_executor/layers/mamba/gdn_linear_attn.py Outdated Show resolved Hide resolved

yma11 reviewed Mar 24, 2026

View reviewed changes

vllm/model_executor/models/qwen3_next.py Show resolved Hide resolved

jikunshang reviewed Mar 24, 2026

View reviewed changes

yma11 mentioned this pull request Mar 25, 2026

[XPU] Support Qwen3-next/Qwen3.5 #33657

Open

5 tasks

wuxun-zhang mentioned this pull request Mar 25, 2026

Qwen3.5 support and optimization plan vllm-project/vllm-xpu-kernels#172

Open

6 tasks

vadiklyutiy moved this to In review in Qwen3.5 Mar 25, 2026

vadiklyutiy added this to Qwen3.5 Mar 25, 2026

mergify bot added the needs-rebase label Mar 26, 2026

wxsIcey added 2 commits March 26, 2026 08:30

[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Nex…

42b2d19

…t and Qwen3.5 Signed-off-by: wxsIcey <1790571317@qq.com>

fix ruff

4ddb549

Signed-off-by: wxsIcey <1790571317@qq.com>

wxsIcey added 6 commits March 26, 2026 08:30

fix qwen3.5 lora

d3a5f54

Signed-off-by: wxsIcey <1790571317@qq.com>

fix error

8045018

Signed-off-by: Icey <1790571317@qq.com>

fix qwen3-next

48fd81c

Signed-off-by: wxsIcey <1790571317@qq.com>

fix lora

03a4429

Signed-off-by: Icey <1790571317@qq.com>

mini fix

4ee9c37

Signed-off-by: wxsIcey <1790571317@qq.com>

resolve conflict

c750f33

Signed-off-by: Icey <1790571317@qq.com>

wxsIcey force-pushed the refactor-gdn branch from 40785c8 to c750f33 Compare March 26, 2026 08:31

jikunshang added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 26, 2026

mergify bot removed the needs-rebase label Mar 26, 2026

wxsIcey added 2 commits March 26, 2026 13:02

fix mypy

0218d05

Signed-off-by: Icey <1790571317@qq.com>

Merge branch 'main' into refactor-gdn

814e2aa

yma11 approved these changes Mar 27, 2026

View reviewed changes

claude bot reviewed Mar 27, 2026

View reviewed changes

wxsIcey added 2 commits March 27, 2026 02:23

remove unuse gdn_linear_attn

d043205

Signed-off-by: Icey <1790571317@qq.com>

remove unuse qkvz_output_size

25e9443

Signed-off-by: Icey <1790571317@qq.com>

jikunshang approved these changes Mar 27, 2026

View reviewed changes

jikunshang merged commit a8eab8f into vllm-project:main Mar 27, 2026
61 checks passed

github-project-automation bot moved this from In review to Done in Qwen3.5 Mar 27, 2026

yma11 pushed a commit to yma11/vllm that referenced this pull request Mar 27, 2026

[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Nex…

cc8e2d0

…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Nex…

012f237

…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com>

Uh oh!

Conversation

wxsIcey commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot commented Mar 24, 2026

Uh oh!

wxsIcey commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZJY0516 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadiklyutiy commented Mar 24, 2026

Uh oh!

vadiklyutiy commented Mar 24, 2026

Uh oh!

gemini-code-assist bot commented Mar 24, 2026

Uh oh!

wxsIcey commented Mar 24, 2026

Uh oh!

ZJY0516 commented Mar 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

wangxiyuan commented Mar 24, 2026

Uh oh!

ZJY0516 commented Mar 24, 2026

Uh oh!

Uh oh!

wxsIcey commented Mar 24, 2026

Uh oh!

yma11 commented Mar 24, 2026

Uh oh!

wxsIcey commented Mar 24, 2026

Uh oh!

jikunshang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jikunshang Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

jikunshang Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

wxsIcey Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yma11 commented Mar 25, 2026

Uh oh!

mergify bot commented Mar 26, 2026

Uh oh!

mergify bot commented Mar 26, 2026

Uh oh!

jikunshang commented Mar 26, 2026

Uh oh!

wxsIcey commented Mar 27, 2026

Uh oh!

yma11 commented Mar 27, 2026

Uh oh!

jikunshang commented Mar 27, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

wxsIcey commented Mar 24, 2026 •

edited

Loading

wxsIcey commented Mar 24, 2026 •

edited

Loading

ZJY0516 commented Mar 24, 2026 •

edited

Loading

wxsIcey Mar 25, 2026 •

edited

Loading