fix(compilation): optimize kv cache update for faster cold start compilation#36007
fix(compilation): optimize kv cache update for faster cold start compilation#36007fourierr wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @fourierr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces an optimization to reduce cold start compilation time by allowing graph reuse for KV cache updates, similar to an existing optimization for MoE layers. The approach involves using a forward context to dynamically provide layer names at runtime instead of hard-coding them in the compiled graph.
The implementation for standard Attention layers is correct. However, the optimization is incomplete for MLAAttention layers. While unified_mla_kv_cache_update is no longer excluded from compilation, the necessary changes to mla_attention.py to avoid graph recompilation are missing. This is a critical issue that needs to be addressed to realize the full benefit of this PR.
Note: Security Review is unavailable for this PR.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/config/compilation.py (1008-1010)
This change removes unified_mla_kv_cache_update from splitting_ops, which means it will now be compiled into the graph. However, the necessary changes to vllm/model_executor/layers/attention/mla_attention.py to support this optimization seem to be missing.
Without these changes, unified_mla_kv_cache_update will still receive a hard-coded layer_name string, causing graph recompilation for each MLAAttention layer and negating the performance benefit of this PR for models using MLA.
To complete this optimization, you should apply a similar pattern as you did for Attention layers:
- In
vllm/model_executor/layers/attention/mla_attention.py:- Add the
_get_layer_name_for_kv_update()method to theMLAAttentionclass. Its implementation can be the same as in theAttentionclass. - In
MLAAttention.forward(), callself._get_layer_name_for_kv_update()and pass its result tounified_mla_kv_cache_updateinstead ofself.layer_name. - Update
unified_mla_kv_cache_updateto handle the"from_forward_context"magic string by retrieving the actual layer name from the forward context, similar to howunified_kv_cache_updateis modified in this PR.
- Add the
This will ensure that MLAAttention layers also benefit from the cold start compilation optimization.
…ilation This fix addresses issue vllm-project#33267 by applying the same approach used in fast_moe_cold_start to unified_kv_cache_update. Problem: - unified_kv_cache_update appears in piecewise cudagraph regions - Each layer has a different name, so each of these have to be compiled separately - This increases cold start compilation time with Dynamo partition because the graphs can no longer be reused Solution: - Add all_kv_cache_layers list and kv_cache_layer_index counter to ForwardContext - When fast_moe_cold_start is enabled, use 'from_forward_context' as the layer_name - At runtime, unified_kv_cache_update retrieves the actual layer name from the forward context instead of using a hard-coded string - This allows torch.compile to better reuse compiled graphs across layers Changes: - vllm/forward_context.py: Added all_kv_cache_layers and kv_cache_layer_index fields - vllm/model_executor/layers/attention/attention.py: Added _get_layer_name_for_kv_update() method and modified unified_kv_cache_update to handle 'from_forward_context' - vllm/config/compilation.py: Removed the temporary workaround that added unified_kv_cache_update to splitting_ops
3728029 to
1bb3dde
Compare
|
Documentation preview: https://vllm--36007.org.readthedocs.build/en/36007/ |
|
Hi @fourierr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Description:
This fix addresses issue #33267 by applying the same approach used in fast_moe_cold_start to unified_kv_cache_update.
Problem:
Solution:
Changes:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.