[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5#37975
[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5#37975jikunshang merged 12 commits intovllm-project:mainfrom
Conversation
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
Since key-value in-contiguous are not supported in xpu and npu, the operators of the For in-tree platform dispatch, I currently do not have a good solution. For out-of-tree platform dispatch, the cc @ZJY0516 |
|
This is a very important and valuable change! The GDN implementation was quite messy—thank you very much for your contribution. Will take a look later today. |
|
/gemini review |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
/gemini review |
|
@codex review |
There was a problem hiding this comment.
Code Review
The Gated Delta Net (GDN) attention implementation, including its custom operations and Triton kernels, has been refactored into a new dedicated file, gdn_linear_attn.py. This new GatedDeltaNetAttention class now serves as a unified implementation for both Qwen3-Next and Qwen3.5 models, replacing the previously separate GDN classes and handling model-specific configurations like GQA interleaved layouts and LoRA compatibility through parameters. A critical issue was identified in the fix_query_key_value_ordering method, where the new_tensor_shape_ba is incorrectly derived from mixed_qkvz.size() instead of mixed_ba.size(), which could lead to a runtime error if the number of tokens differs between these tensors.
|
yes, we need it for long time. |
|
please test qwen 3.5, qwen 3 next and lora |
OK, I will add it. |
I tested this refactor on XPU platform, Qwen3.5-9B shows accuracy issue. Can you check? |
I just performed a simple test on the A100, and the output is normal. I don't have an XPU machine for testing. What are your test cases and outputs? Or did you perform an accuracy test? |
jikunshang
left a comment
There was a problem hiding this comment.
Thanks for refactoring. just some minor comments from my side.
| use_qk_l2norm_in_kernel=use_qk_l2norm_in_kernel, | ||
| ) | ||
|
|
||
| def forward_native( |
There was a problem hiding this comment.
ideally, forward_native should be a torch-native impl, so every platform could leverage. using triton here cpu platform will throw error. I am ok to keep this. just some minor concern. cc @bigPYJ1151
There was a problem hiding this comment.
I don't think we need a torch-native impl, just like no torch-native flash attn in vllm
There was a problem hiding this comment.
Agree we don't need it here. it's just about naming, maybe we should rename to forward_triton to avoid confusion.
My understanding is forward_native in CustomOp should be a torch-native impl. https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/custom_op.py#L138-L144
There was a problem hiding this comment.
I agree that forward_native should be a torch-native implementation,so using triton here is not reasonable. However, CustomOp is a platform-specific forward dispatch ( https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/custom_op.py#L196-L207 ). forward_triton does not have this functionality. I think the best practice is the vLLM IR proposed by Luka ( #32358 ), we can define triton as an IR kernel and specify its platform-wide usage.
So we can wait for the IR PR to be merged and the code to be refactored. Is that acceptable?
There was a problem hiding this comment.
I think we can do this later — it doesn't necessarily have to be done in this PR.
Thanks for check. Then maybe our platform specific issue. Let me take a look further. |
|
This pull request has merge conflicts that must be resolved before it can be |
…t and Qwen3.5 Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Icey <1790571317@qq.com>
|
Hi @wxsIcey, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
I triggered |
Thank you. All tests have passed. Is it possible to get approval? |
|
@jikunshang PTAL again. |
|
@claude review |
There was a problem hiding this comment.
A few things still need attention before this can merge: there is an unresolved reviewer question on gdn_linear_attn.py, pre-commit was failing as of March 26 (unclear if fixed), and the gdn_in_proj stub calls an undefined _forward_in_proj method (see inline comment).
Extended reasoning...
Overview
This PR extracts GatedDeltaNetAttention from qwen3_next.py into a new shared gdn_linear_attn.py and removes the Qwen3NextGatedDeltaNet/Qwen3_5GatedDeltaNet subclass hierarchy in favour of a single parameterised class. It touches ~1000 lines across three files, primarily a refactor but with meaningful behavioural changes.
Security Risks
No security-sensitive code paths are touched. Risk is limited to model correctness and inference performance.
Level of Scrutiny
The change is architecturally meaningful as the foundation for out-of-tree platform dispatch unifying two previously divergent implementations. LM Eval on B200 passed, but several reviewer concerns remain open. This warrants human approval rather than bot shadow-approval.
Other Factors
- Unresolved reviewer comment - The inline question on gdn_linear_attn.py (comment id 2981910661, "is this necessary?") is not marked resolved in the timeline.
- Pre-commit failure - Mergify reported a pre-commit failure on March 26; it is not confirmed whether this was subsequently fixed.
- gdn_in_proj dead code - The stub function at line 950 calls self._forward_in_proj which does not exist on GatedDeltaNetAttention. While currently unreachable dead code, it is a footgun for future platform-plugin authors (see inline comment for details).
- forward_native naming - ChunkGatedDeltaRule.forward_native actually invokes a Triton/FLA kernel, violating the CustomOp convention. The team agreed to defer this to a follow-up.
| layer_name: str, | ||
| ) -> tuple[torch.Tensor, torch.Tensor]: | ||
| """ | ||
| Custom op for the input projection. | ||
| """ | ||
| forward_context: ForwardContext = get_forward_context() | ||
| self = forward_context.no_compile_layers[layer_name] | ||
| return self._forward_in_proj(hidden_states) | ||
|
|
||
|
|
||
| def gdn_attention_core( | ||
| mixed_qkv: torch.Tensor, |
There was a problem hiding this comment.
🟡 The gdn_in_proj function (line 950) calls self._forward_in_proj(hidden_states), but GatedDeltaNetAttention defines no _forward_in_proj method — this would raise AttributeError if invoked. The function is also never registered via direct_register_custom_op (unlike gdn_attention_core), making it unreachable dead code. The _qkvz_output_size helper (with docstring "for gdn_in_proj fake impl") is similarly orphaned and should be removed or completed.
Extended reasoning...
The gdn_in_proj function defined at lines 946–957 of vllm/model_executor/layers/mamba/gdn_linear_attn.py is an incomplete custom-op stub left over from the refactoring. It retrieves the layer from the forward context and then calls self._forward_in_proj(hidden_states), but GatedDeltaNetAttention (and none of its base classes — PluggableLayer, MambaBase) defines any _forward_in_proj method. A grep of the entire codebase confirms only one occurrence of _forward_in_proj: the call site at line 957 itself.
The companion _qkvz_output_size method at line 668 has a docstring that explicitly reads "for gdn_in_proj fake impl", confirming that the author intended this to become a full custom op (analogous to the working gdn_attention_core op), with a real implementation and a fake/shape-only implementation for torch.compile. That work was never completed: direct_register_custom_op is called only for gdn_attention_core, never for gdn_in_proj.
As things stand, gdn_in_proj is unreachable dead code. No code path in the codebase calls it directly, and it is not registered as a torch.ops.vllm.* custom op that could be dispatched to. So there is no runtime failure today.
However, the purpose of PluggableLayer is precisely to allow out-of-tree platform plugins (e.g., XPU, NPU) to register alternative forward implementations. If a plugin author discovers gdn_in_proj and tries to wire it up as a custom op for the input-projection step, the stub will crash with AttributeError: GatedDeltaNetAttention object has no attribute _forward_in_proj the first time it is called. This makes the code a footgun for future contributors and platform integrators.
Step-by-step proof:
- A platform plugin registers
gdn_in_projviadirect_register_custom_oppointing atgdn_in_proj. - The plugin arranges for the forward pass to call
torch.ops.vllm.gdn_in_proj(hidden_states, qkvz_size, ba_size, self.prefix). gdn_in_projruns, resolvesselffrom the forward context, then executesself._forward_in_proj(hidden_states).- Python raises
AttributeError: GatedDeltaNetAttention object has no attribute _forward_in_proj.
Fix: Either (a) implement _forward_in_proj on GatedDeltaNetAttention and register gdn_in_proj with direct_register_custom_op plus a gdn_in_proj_fake shape function, or (b) remove gdn_in_proj and _qkvz_output_size entirely if the custom-op abstraction for input projection is not needed in this PR.
There was a problem hiding this comment.
Thanks for the review. I forgot to remove the unnecessary gdn_in_proj when resolving the conflict at #38152. It has been fixed.
Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: Icey <1790571317@qq.com>
|
@claude review |
…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com>
…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…t and Qwen3.5 (vllm-project#37975) Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Icey <1790571317@qq.com>
Purpose
Move the GDN (Gated Delta Net) layer implementation from
qwen3_next.pyinto a dedicatedgdn_linear_attn.py, and unify Qwen3Next and Qwen3.5 under a singleGatedDeltaNetAttentionclass.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.