[Bugfix]: Correct handling of cos_sin_cache length#1900
[Bugfix]: Correct handling of cos_sin_cache length#1900jianzs wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
@whx-sjtu PTAL |
There was a problem hiding this comment.
Pull Request Overview
This PR fixes performance issues in rotary embedding cos/sin cache handling by correcting variable usage and preventing unnecessary cache recreation. The fix ensures that the cache, which is already initialized with maximum context length, is not unnecessarily recreated during processing.
- Replaces cache recreation logic with an error when max_seq_len exceeds the initialized maximum
- Corrects variable assignment in
_set_cos_sin_cachefrommax_seq_len_cachedtomax_seq_len - Removes redundant
max_seq_lenassignment during initialization since the cache setup handles this
whx-sjtu
left a comment
There was a problem hiding this comment.
I reviewed codes of branch v0.9.1-dev and found that this problem has already been solved in that branch while hasn't been ported to main. Thanks for finding and fixing this. LGTM.
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (40.00%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1900 +/- ##
=======================================
Coverage 76.09% 76.09%
=======================================
Files 114 114
Lines 13103 13100 -3
=======================================
- Hits 9971 9969 -2
+ Misses 3132 3131 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@whx-sjtu Which PR fixed this issue in the 0.9.1-dev branch? |
|
|
||
| def _set_cos_sin_cache(self, seq_len, device, dtype): | ||
| self.max_seq_len_cached = seq_len | ||
| self.max_seq_len = seq_len * self.scaling_factor |
There was a problem hiding this comment.
There is no problem in v0.9.1. what happens.
There was a problem hiding this comment.
So can we just cherry-pick this one?
There was a problem hiding this comment.
Ok, I've cherry-picked #1551 into the current pr.
20683f4 to
3eb5b75
Compare
|
@zzzzwwjj PTAL |
2fee7a0 to
d07d61a
Compare
fix OOM error when `chunked_prefill_for_mla` is enable and long input scene. Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
d07d61a to
e25dee4
Compare
|
@wangxiyuan @Yikun @ganyi1996ppo ready to merge. |
| position=positions, | ||
| attn_state=attn_state) | ||
| # NOTE: when use ring_mla, attn_mask don't need to generate here. | ||
| if not self.use_ring_mla or attn_state == AscendAttentionState.PrefillNoCache: |
There was a problem hiding this comment.
According to the comments, there should be and rather than or?
There was a problem hiding this comment.
According to the comments, there should be and rather than or?
When use ring_mla and attn_state == PrefillNoCache, it will not use ringattn op and also need attn_mask. So or is right.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
This PR addresses the performance issue related to cos/sin cache handling:
The cos/sin cache is already initialized with the maximum context length during initialization. However, due to
max_seq_len_cachebeing stored asseq_len, the condition check was incorrect, leading to unnecessary cache recreation.Since the cos/sin cache is already initialized with maximum context length, it should not trigger recreation during the process.
Fixed variable naming:
max_seq_len_cachewas never used and should bemax_seq_len. This also is the correct variable to check against the maximum context length.Does this PR introduce any user-facing change?
No
How was this patch tested?
CI pass.