[Bugfix] Reduce _npu_flash_attention mask to 128x128 for memory savings#1100
[Bugfix] Reduce _npu_flash_attention mask to 128x128 for memory savings#1100ApsarasX wants to merge 1 commit intovllm-project:mainfrom
Conversation
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + #910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] #1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
|
Is this PR still needed? If so, please rebase, otherwise please close it. Thanks |
Signed-off-by: ApsarasX <apsarax@outlook.com>
d951107 to
df4cee4
Compare
This PR is still necessary once torch_npu._npu_flash_attention supports compressed 128x128 mask I have already rebased this PR |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1100 +/- ##
=======================================
Coverage 76.31% 76.31%
=======================================
Files 116 116
Lines 13238 13238
=======================================
Hits 10102 10102
Misses 3136 3136
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
### What this PR does / why we need it? Single machine 16 cards deepseekr1 attention (tp8/dp2) / moe(etp) Best performance rely on: vllm-ascend commit id:da9acfca6053352730fce75fb772e214755d0341 vllm commit id:b124e1085b1bf977e3dac96d99ffd9d8ddfdb6cc + vllm-project#910 + [Reduce _npu_flash_attention mask to 128x128 for memory savings] vllm-project#1100 [Reduce memory usage by splitting tokens in fused_experts] --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
What this PR does / why we need it?
Avoid generating a special
attn_mask(which theoretically could be as large asbf16[max_model_len, max_model_len]) whenchunked-prefillis not enabled and the user input is extremely long.However, this PR depends on modifications to the
torch_npu._npu_flash_attentionDoes this PR introduce any user-facing change?
No
How was this patch tested?
No