[Model] Add support for openPangu moe model#28775
[Model] Add support for openPangu moe model#28775vllm-bot merged 26 commits intovllm-project:mainfrom
Conversation
…erent kv head size and sink kv in attention. Signed-off-by: yuantao <2422264527@qq.com>
|
Documentation preview: https://vllm--28775.org.readthedocs.build/en/28775/ |
There was a problem hiding this comment.
Code Review
This pull request adds support for the openPangu_Pro_Moe_v2 model, which introduces a new attention mechanism with sink KV caches. The changes are extensive, touching the model definition, attention layers, and KV cache management. I've identified a couple of critical bugs in the implementation concerning tensor initialization and block table manipulation in the new attention backend. Additionally, I've suggested a refactoring to improve code maintainability by reducing duplication. Addressing these points will enhance the correctness and robustness of the new model support.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
… refactor forward in unified_attention_with_output Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yuantao <2422264527@qq.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
|
@LucasWilkinson @zou3519 @mgoin can you review the attention implementation? |
…to GPUModelRunner Signed-off-by: yuantao <2422264527@qq.com>
|
@LucasWilkinson Hello, I notice that there are two failing checks in CI, but it seems the failing code has no relevance with our modification. What can I do for this? |
|
Retrying the failed test |
Hello, I tried to test the failing case in my local environment and it seems work well, since the failing code has no relevance with our modification. So I wonder if there are any missing bugs. My test reults are as follows: |
|
Hi @yt0428, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: yuantao <2422264527@qq.com>
Head branch was pushed to by a user without write access
@DarkLight1337 Hello, do you have any suggestions on how to deal with these failing CI checks? |
|
Force merging |
|
@DarkLight1337 I am late again here. I think that there was a slight oversight. The
@yt0428 Can you please help put a PR that fixes those failures? |
|
EDIT: I just pushed a PR that fixes the issue without hopefully breaking any models. Let me know your thoughts :) |
Hello, I have read you PR and I think the fix is reasonable. Thanks for your fix and efforts! |
Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…im] (vllm-project#32274) Summary: The breakage was introduced in D89937241(vllm-project#28775) and D90045073(vllm-project#31596). We will see reshaping errors with the return values of the attention layer. When the query shape is 4D, [batch_size, num_tokens, num_heads, head_dim], the output shape will be composed as [batch_size, num_heads * head_dim] however the correct shape should be [batch_size, num_tokens, num_heads * head_dim] instead. Test Plan: Patched this diff and tested vllm local services, it worked with no issue. Reviewed By: frank-wei Differential Revision: D90600898
Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Purpose
This PR adds support for openpangu moe model, which characterizes by its different kv head size and sink kv in attention.
The model has two new features that have not been supported:
Although flash_attn kernel can handle the different kv head size, current vllm framwork does not make it optional choice. In the implemented
FlashSinkAttentionBackend, the kv_cache_shape is defined to be of shape[num_blocks, block_size, num_kv_heads, head_size_k + head_size_v]. The corresponding kv_cache update functionreshape_and_cache_kernel_flash_diffkvis implemented by triton.sink_keyandsink_valuein attentionThe attention module of the model receive two more arguments
sink_keyandsink_value, which is learned durining the training and shared by all inputs. In this initial implementation, I store them in the first blocks in the block pool and remove these blocks from the free blocks, so they can not be scheduled to avoid overwriting. During the forward ofFlashSinkAttentionBackend, block_ids ofsink_keyandsink_valueare concated to the normal block_table to calculate attention correctly.11.26 Update:
We refactor the code to seperate two features above, specifically:
FlashSinkAttentionBackendtoFlashDiffkvAttentionBackend, which is modified fromFlashAttentionBackendto support different head_size for key and value.sink_keyrelated logic toGPUModelRunner, in which we primarily do two thing. First, we store the sink_key and sink_value to kv_caches during the initialization of kv_caches. This is implemented by adding a functionprepare_sink_kv_cacheand called ininitializa_kv_cache. Second, we modify the blk_table_tensor and seq_lens in-place in_build_attention_metadata, so that attention backends will know that there are sink_key and sink_value in kv_cachesThe current implementation is just functional corrently but not so elegant. And suggestions from those who are familiar with these parts of vllm are very appreciated. Many thanks!!
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.