[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5#7581
[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5#7581wxsIcey wants to merge 5 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request streamlines the implementation of Gated Delta Net Attention for Qwen3Next and Qwen3.5 models by consolidating their logic into a single, shared layer. This change improves code maintainability and reduces redundancy by centralizing Ascend-specific optimizations for this attention mechanism, rather than having separate patches for each model. The unified approach ensures consistent behavior and easier future updates across these models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully refactors the GatedDeltaNetAttention implementations for Qwen3Next and Qwen3.5 into a single shared layer, AscendGatedDeltaNetAttention. This is a good architectural improvement that enhances code maintainability by reducing duplication. The overall logic for handling both model variations within the new unified class appears correct. I have one suggestion to simplify a complex and repetitive section of the code, which should improve readability.
| if spec_sequence_masks is not None and core_attn_out_non_spec is not None: | ||
| merged_out = torch.empty( | ||
| (1, num_actual_tokens, *core_attn_out_spec.shape[2:]), | ||
| dtype=core_attn_out_non_spec.dtype, | ||
| device=core_attn_out_non_spec.device, | ||
| ) | ||
| merged_out.index_copy_(1, spec_token_indx, core_attn_out_spec) | ||
| merged_out.index_copy_(1, non_spec_token_indx, core_attn_out_non_spec) | ||
| if not enable_sp(): | ||
| core_attn_out[:num_actual_tokens] = merged_out.squeeze(0) | ||
| else: | ||
| core_attn_out[:num_actual_tokens] = merged_out.squeeze(0)[:num_actual_tokens] | ||
| elif spec_sequence_masks is not None: | ||
| if not enable_sp(): | ||
| core_attn_out[:num_actual_tokens] = core_attn_out_spec.squeeze(0) | ||
| else: | ||
| core_attn_out[:num_actual_tokens] = core_attn_out_spec.squeeze(0)[:num_actual_tokens] | ||
| else: | ||
| if not enable_sp(): | ||
| core_attn_out[:num_actual_tokens] = core_attn_out_non_spec.squeeze(0) | ||
| else: | ||
| core_attn_out[:num_actual_tokens] = core_attn_out_non_spec.squeeze(0)[:num_actual_tokens] |
There was a problem hiding this comment.
The logic for merging the core attention output is complex and contains repeated if/else blocks for handling enable_sp(). This can be simplified to improve readability and maintainability by first determining the output tensor and then applying the enable_sp() related slicing once.
# 3. Merge core attention output
if spec_sequence_masks is not None and core_attn_out_non_spec is not None:
final_out = torch.empty(
(1, num_actual_tokens, *core_attn_out_spec.shape[2:]),
dtype=core_attn_out_non_spec.dtype,
device=core_attn_out_non_spec.device,
)
final_out.index_copy_(1, spec_token_indx, core_attn_out_spec)
final_out.index_copy_(1, non_spec_token_indx, core_attn_out_non_spec)
final_out = final_out.squeeze(0)
elif spec_sequence_masks is not None:
final_out = core_attn_out_spec.squeeze(0)
else:
final_out = core_attn_out_non_spec.squeeze(0)
if enable_sp():
final_out = final_out[:num_actual_tokens]
core_attn_out[:num_actual_tokens] = final_out|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
…t and Qwen3.5 Signed-off-by: wxsIcey <1790571317@qq.com>
a744b1d to
0e7d1c8
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Have been cherry-picked by #7962 |
What this PR does / why we need it?
Move the GDN (Gated Delta Net) layer implementation from
qwen3_next.pyinto a dedicatedgdn_linear_attn.py, and unify Qwen3Next and Qwen3.5 under a singleGatedDeltaNetAttentionclass.needs vllm-project/vllm#37975
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
CI passed with existing test.