fix(compilation): optimize kv cache update for faster cold start compilation by fourierr · Pull Request #36007 · vllm-project/vllm

fourierr · 2026-03-04T12:49:48Z

Description:

This fix addresses issue #33267 by applying the same approach used in fast_moe_cold_start to unified_kv_cache_update.

Problem:

unified_kv_cache_update appears in piecewise cudagraph regions
Each layer has a different name, so each of these have to be compiled separately
This increases cold start compilation time with Dynamo partition because the graphs can no longer be reused

Solution:

Add all_kv_cache_layers list and kv_cache_layer_index counter to ForwardContext
When fast_moe_cold_start is enabled, use 'from_forward_context' as the layer_name
At runtime, unified_kv_cache_update retrieves the actual layer name from the forward context instead of using a hard-coded string
This allows torch.compile to better reuse compiled graphs across layers

Changes:

vllm/forward_context.py: Added all_kv_cache_layers and kv_cache_layer_index fields
vllm/model_executor/layers/attention/attention.py: Added _get_layer_name_for_kv_update() method and modified unified_kv_cache_update to handle 'from_forward_context'
vllm/config/compilation.py: Removed the temporary workaround that added unified_kv_cache_update to splitting_ops

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2026-03-04T12:49:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-03-04T12:54:28Z

Hi @fourierr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

gemini-code-assist

Code Review

This pull request introduces an optimization to reduce cold start compilation time by allowing graph reuse for KV cache updates, similar to an existing optimization for MoE layers. The approach involves using a forward context to dynamically provide layer names at runtime instead of hard-coding them in the compiled graph.

The implementation for standard Attention layers is correct. However, the optimization is incomplete for MLAAttention layers. While unified_mla_kv_cache_update is no longer excluded from compilation, the necessary changes to mla_attention.py to avoid graph recompilation are missing. This is a critical issue that needs to be addressed to realize the full benefit of this PR.

_{Note: Security Review is unavailable for this PR.}

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/config/compilation.py (1008-1010)

This change removes unified_mla_kv_cache_update from splitting_ops, which means it will now be compiled into the graph. However, the necessary changes to vllm/model_executor/layers/attention/mla_attention.py to support this optimization seem to be missing.

Without these changes, unified_mla_kv_cache_update will still receive a hard-coded layer_name string, causing graph recompilation for each MLAAttention layer and negating the performance benefit of this PR for models using MLA.

To complete this optimization, you should apply a similar pattern as you did for Attention layers:

In vllm/model_executor/layers/attention/mla_attention.py:
- Add the _get_layer_name_for_kv_update() method to the MLAAttention class. Its implementation can be the same as in the Attention class.
- In MLAAttention.forward(), call self._get_layer_name_for_kv_update() and pass its result to unified_mla_kv_cache_update instead of self.layer_name.
- Update unified_mla_kv_cache_update to handle the "from_forward_context" magic string by retrieving the actual layer name from the forward context, similar to how unified_kv_cache_update is modified in this PR.

This will ensure that MLAAttention layers also benefit from the cold start compilation optimization.

…ilation This fix addresses issue vllm-project#33267 by applying the same approach used in fast_moe_cold_start to unified_kv_cache_update. Problem: - unified_kv_cache_update appears in piecewise cudagraph regions - Each layer has a different name, so each of these have to be compiled separately - This increases cold start compilation time with Dynamo partition because the graphs can no longer be reused Solution: - Add all_kv_cache_layers list and kv_cache_layer_index counter to ForwardContext - When fast_moe_cold_start is enabled, use 'from_forward_context' as the layer_name - At runtime, unified_kv_cache_update retrieves the actual layer name from the forward context instead of using a hard-coded string - This allows torch.compile to better reuse compiled graphs across layers Changes: - vllm/forward_context.py: Added all_kv_cache_layers and kv_cache_layer_index fields - vllm/model_executor/layers/attention/attention.py: Added _get_layer_name_for_kv_update() method and modified unified_kv_cache_update to handle 'from_forward_context' - vllm/config/compilation.py: Removed the temporary workaround that added unified_kv_cache_update to splitting_ops

mergify · 2026-03-04T13:10:18Z

Documentation preview: https://vllm--36007.org.readthedocs.build/en/36007/

mergify · 2026-03-04T13:15:07Z

Hi @fourierr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

fourierr requested review from LucasWilkinson, MatthewBonanni, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners March 4, 2026 12:49

fourierr changed the title ~~fix(compilation): optimize kv cache update for faster cold start comp…~~ fix(compilation): optimize kv cache update for faster cold start compilation Mar 4, 2026

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

fourierr force-pushed the fix/issue-33267-kv-cache-compilation branch from 3728029 to 1bb3dde Compare March 4, 2026 13:09

mergify bot added the documentation Improvements or additions to documentation label Mar 4, 2026

This was referenced Apr 10, 2026

BaseKVCacheMethod.apply_kv_cache captainpete/vllm#2

Open

Deterministic Hadamard KQ rotation captainpete/vllm#1

Open

Doondi-Ashlesh mentioned this pull request Apr 18, 2026

[torch.compile] Remove layer name from unified_kv_cache_update / unified_mla_kv_cache_update to fix cold-start (#33267) #40187

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(compilation): optimize kv cache update for faster cold start compilation#36007

fix(compilation): optimize kv cache update for faster cold start compilation#36007
fourierr wants to merge 1 commit intovllm-project:mainfrom
fourierr:fix/issue-33267-kv-cache-compilation

fourierr commented Mar 4, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fourierr commented Mar 4, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

vllm/config/compilation.py (1008-1010)

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

mergify bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fourierr commented Mar 4, 2026 •

edited by github-actions bot

Loading