feat(mla): extract KV-cache update#33250
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @dw2761, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces a clean refactoring in the MLACommonImpl class by extracting the KV-cache update logic into a new _update_kv_cache method. The execution of this new method within the forward pass is controlled by a new boolean class attribute, forward_includes_kv_cache_update, which is set to True by default to maintain existing behavior. This change improves code modularity and prepares for future enhancements as described. The implementation is correct and the logic is preserved. The provided benchmarks confirm that there is no performance regression.
b24e4ee to
9368dd7
Compare
Signed-off-by: Di Wu <dw2761@nyu.edu>
Signed-off-by: Di Wu <dw2761@nyu.edu>
9368dd7 to
ca2c4e7
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Di Wu <95495325+dw2761@users.noreply.github.com>
|
Rebased on top of #33284. This PR no longer has meaningful diff vs main, so I’m closing it as superseded. |
Purpose
This PR is a small refactor towards #32335.
MLACommonImpl (vllm/model_executor/layers/attention/mla_attention.py)
Added forward_includes_kv_cache_update = True on MLACommonImpl to preserve current behavior while enabling gradual migration.
Implemented _update_kv_cache() method that:
Calls ops.concat_and_cache_mla(...) to write MLA KV-cache (latent + RoPE-related components) using attn_metadata.slot_mapping
Updated MLACommonImpl.forward() to call KV-cache update via:
if self.forward_includes_kv_cache_update: self._update_kv_cache(...)
Test Plan
--attention-backend FLASH_ATTN_MLA
Test Result
Latency (dummy weights)
Model: deepseek-ai/DeepSeek-V2-Lite-Chat