feat(mla): add default do_kv_cache_update for MLA#33658
feat(mla): add default do_kv_cache_update for MLA#33658dw2761 wants to merge 13 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Di Wu <dw2761@nyu.edu>
There was a problem hiding this comment.
Code Review
This pull request refactors the MLA KV-cache update logic by moving it into a default do_kv_cache_update method on MLAAttentionImpl. This is a good simplification of the MLAAttention layer. However, there is a critical issue where this new method is not implemented for SparseMLAAttentionImpl, which will cause a runtime error when using a sparse MLA backend. My review includes a comment detailing this issue and how to resolve it.
Signed-off-by: Di Wu <dw2761@nyu.edu>
Signed-off-by: Di Wu <dw2761@nyu.edu>
|
What is the status of this PR? @dw2761 @ProExpertProg |
Signed-off-by: Di Wu <dw2761@nyu.edu>
…feat/mla-kv-cache-update-v2
I just updated the branch with the latest main and pushed a fix for a circular-import issue that was breaking CI. The checks are still running right now. If everything turns green, I’ll request review and hopefully get this merged ASAP. |
|
@dw2761 Great, thanks a lot! |
Signed-off-by: Di Wu <dw2761@nyu.edu>
…feat/mla-kv-cache-update-v2
| attn_metadata = attn_metadata[self.layer_name] | ||
| self_kv_cache = self.kv_cache[forward_context.virtual_engine] | ||
|
|
||
| # Write the latent and rope to kv cache |
There was a problem hiding this comment.
This is missing from the indirect call path below. A new custom op needs to be created (like unified_kv_cache_update for GQA - see attention.py), and then that should be called before calling the MLA attention op
Signed-off-by: Di Wu <dw2761@nyu.edu>
Sure! I'd be happy to be added as a co author of #34627. I left a review comment in #34627. Please check it! |
Hi @ElizaWszola! I checked the commit but it seems I'm not listed as a formal co-author yet. Could you please amend the commit message to include the standard trailer at the end? It should look like this: And I'll close this PR/ Thanks! |
I amended a couple commits listing you as author/co-author. To my knowledge, this should list you as a co-author of the PR when it's merged into main, but feel free to correct me if I'm wrong |
I checked the commit but it seems I'm not listed as a formal co-author yet. Could you please amend the commit message to include the standard trailer at the end?
Looks good! thx! |
|
Closing in favor of #34627 |
Purpose
This PR is part of #32335
It extracts the MLA KV-cache update op from the MLA attention layer into a default
MLAAttentionImpl.do_kv_cache_updateimplementation.Test Plan
Run v1 latency benchmark with dummy weights on both main and this PR branch, explicitly selecting the MLA backend
--attention-backend FLASH_ATTN_MLATest Result
Model: deepseek-ai/DeepSeek-V2-Lite-Chat | Backend: FLASH_ATTN_MLA | Weights: dummy | TP=1
Latency