Skip to content

fix(compilation): optimize kv cache update for faster cold start compilation#36007

Open
fourierr wants to merge 1 commit intovllm-project:mainfrom
fourierr:fix/issue-33267-kv-cache-compilation
Open

fix(compilation): optimize kv cache update for faster cold start compilation#36007
fourierr wants to merge 1 commit intovllm-project:mainfrom
fourierr:fix/issue-33267-kv-cache-compilation

Conversation

@fourierr
Copy link
Copy Markdown

@fourierr fourierr commented Mar 4, 2026

Description:

This fix addresses issue #33267 by applying the same approach used in fast_moe_cold_start to unified_kv_cache_update.

Problem:

  • unified_kv_cache_update appears in piecewise cudagraph regions
  • Each layer has a different name, so each of these have to be compiled separately
  • This increases cold start compilation time with Dynamo partition because the graphs can no longer be reused

Solution:

  • Add all_kv_cache_layers list and kv_cache_layer_index counter to ForwardContext
  • When fast_moe_cold_start is enabled, use 'from_forward_context' as the layer_name
  • At runtime, unified_kv_cache_update retrieves the actual layer name from the forward context instead of using a hard-coded string
  • This allows torch.compile to better reuse compiled graphs across layers

Changes:

  • vllm/forward_context.py: Added all_kv_cache_layers and kv_cache_layer_index fields
  • vllm/model_executor/layers/attention/attention.py: Added _get_layer_name_for_kv_update() method and modified unified_kv_cache_update to handle 'from_forward_context'
  • vllm/config/compilation.py: Removed the temporary workaround that added unified_kv_cache_update to splitting_ops

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 4, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@fourierr fourierr changed the title fix(compilation): optimize kv cache update for faster cold start comp… fix(compilation): optimize kv cache update for faster cold start compilation Mar 4, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 4, 2026

Hi @fourierr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization to reduce cold start compilation time by allowing graph reuse for KV cache updates, similar to an existing optimization for MoE layers. The approach involves using a forward context to dynamically provide layer names at runtime instead of hard-coding them in the compiled graph.

The implementation for standard Attention layers is correct. However, the optimization is incomplete for MLAAttention layers. While unified_mla_kv_cache_update is no longer excluded from compilation, the necessary changes to mla_attention.py to avoid graph recompilation are missing. This is a critical issue that needs to be addressed to realize the full benefit of this PR.

Note: Security Review is unavailable for this PR.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/config/compilation.py (1008-1010)

critical

This change removes unified_mla_kv_cache_update from splitting_ops, which means it will now be compiled into the graph. However, the necessary changes to vllm/model_executor/layers/attention/mla_attention.py to support this optimization seem to be missing.

Without these changes, unified_mla_kv_cache_update will still receive a hard-coded layer_name string, causing graph recompilation for each MLAAttention layer and negating the performance benefit of this PR for models using MLA.

To complete this optimization, you should apply a similar pattern as you did for Attention layers:

  1. In vllm/model_executor/layers/attention/mla_attention.py:
    • Add the _get_layer_name_for_kv_update() method to the MLAAttention class. Its implementation can be the same as in the Attention class.
    • In MLAAttention.forward(), call self._get_layer_name_for_kv_update() and pass its result to unified_mla_kv_cache_update instead of self.layer_name.
    • Update unified_mla_kv_cache_update to handle the "from_forward_context" magic string by retrieving the actual layer name from the forward context, similar to how unified_kv_cache_update is modified in this PR.

This will ensure that MLAAttention layers also benefit from the cold start compilation optimization.

…ilation

This fix addresses issue vllm-project#33267 by applying the same approach used in
fast_moe_cold_start to unified_kv_cache_update.

Problem:
- unified_kv_cache_update appears in piecewise cudagraph regions
- Each layer has a different name, so each of these have to be compiled separately
- This increases cold start compilation time with Dynamo partition because
  the graphs can no longer be reused

Solution:
- Add all_kv_cache_layers list and kv_cache_layer_index counter to ForwardContext
- When fast_moe_cold_start is enabled, use 'from_forward_context' as the layer_name
- At runtime, unified_kv_cache_update retrieves the actual layer name from the
  forward context instead of using a hard-coded string
- This allows torch.compile to better reuse compiled graphs across layers

Changes:
- vllm/forward_context.py: Added all_kv_cache_layers and kv_cache_layer_index fields
- vllm/model_executor/layers/attention/attention.py: Added _get_layer_name_for_kv_update()
  method and modified unified_kv_cache_update to handle 'from_forward_context'
- vllm/config/compilation.py: Removed the temporary workaround that added
  unified_kv_cache_update to splitting_ops
@fourierr fourierr force-pushed the fix/issue-33267-kv-cache-compilation branch from 3728029 to 1bb3dde Compare March 4, 2026 13:09
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 4, 2026

Documentation preview: https://vllm--36007.org.readthedocs.build/en/36007/

@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 4, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 4, 2026

Hi @fourierr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant