[PD][Bugfix] Fix KV Cache sharing with HMA by NickLucche · Pull Request #44629 · vllm-project/vllm

NickLucche · 2026-06-05T07:54:47Z

HMA is now the default way to serve models even when a connector is provided as of #41847.

Unfortunately we missed covering an important feature that is layers that share the same KV cache tensor, which currently crashes on main when attempting to fetch the corresponding layer_spec.
The fix is straightforward, as we ensure that layers that do not need a kv cache are not present in the kv_cache_spec at setup time

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 7441 to 7459 in 3da29aa

    
           kv_cache_spec: dict[str, KVCacheSpec] = {} 
        
           layer_type = cast(type[Any], AttentionLayerBase) 
        
           attn_layers = get_layers_from_vllm_config(self.vllm_config, layer_type) 
        
           for layer_name, attn_module in attn_layers.items(): 
        
               if isinstance(attn_module, Attention) and ( 
        
                   kv_tgt_layer := attn_module.kv_sharing_target_layer_name 
        
               ): 
        
                   # The layer doesn't need its own KV cache and will use that of 
        
                   # the target layer. We skip creating a KVCacheSpec for it, so 
        
                   # that KV cache management logic will act as this layer does 
        
                   # not exist, and doesn't allocate KV cache for the layer. This 
        
                   # enables the memory saving of cross-layer kv sharing, allowing 
        
                   # a given amount of memory to accommodate longer context lengths 
        
                   # or enable more requests to be processed simultaneously. 
        
                   self.shared_kv_cache_layers[layer_name] = kv_tgt_layer 
        
                   continue 
        
               # Skip modules that don't need KV cache (eg encoder-only attention) 
        
               if spec := attn_module.get_kv_cache_spec(self.vllm_config): 
        
                   kv_cache_spec[layer_name] = spec

All we have to do in the PD connector is skip registration for those layers, like we did before supporting HMA.

Tested with a google/gemma-4-E2B-it PD deployment

local-chat-completions ({'model': 'google/gemma-4-E2B-it', 'base_url': 'http://127.0.0.1:25068/v1/chat/completions', 'tokenizer_backend': 'huggingface', 'max_concurrency': 100}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6573|±  |0.0131|
|     |       |strict-match    |     5|exact_match|↑  |0.4860|±  |0.0138|

and added test case to our eval suite

Signed-off-by: NickLucche <nlucches@redhat.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-06-05T07:55:55Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-06-05T08:13:56Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

NickLucche added 3 commits June 4, 2026 15:55

init

0ee1a3f

Signed-off-by: NickLucche <nlucches@redhat.com>

tests

7047cd1

Signed-off-by: NickLucche <nlucches@redhat.com>

accuracy

328e140

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche requested review from ApostaC, orozery and xuechendi as code owners June 5, 2026 07:54

claude Bot reviewed Jun 5, 2026

View reviewed changes

mergify Bot added v1 bug Something isn't working kv-connector labels Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PD][Bugfix] Fix KV Cache sharing with HMA#44629

[PD][Bugfix] Fix KV Cache sharing with HMA#44629
NickLucche wants to merge 3 commits into
vllm-project:mainfrom
NickLucche:fix-pd-kv-sharing

NickLucche commented Jun 5, 2026

Uh oh!

claude Bot left a comment

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	kv_cache_spec: dict[str, KVCacheSpec] = {}
	layer_type = cast(type[Any], AttentionLayerBase)
	attn_layers = get_layers_from_vllm_config(self.vllm_config, layer_type)
	for layer_name, attn_module in attn_layers.items():
	if isinstance(attn_module, Attention) and (
	kv_tgt_layer := attn_module.kv_sharing_target_layer_name
	):
	# The layer doesn't need its own KV cache and will use that of
	# the target layer. We skip creating a KVCacheSpec for it, so
	# that KV cache management logic will act as this layer does
	# not exist, and doesn't allocate KV cache for the layer. This
	# enables the memory saving of cross-layer kv sharing, allowing
	# a given amount of memory to accommodate longer context lengths
	# or enable more requests to be processed simultaneously.
	self.shared_kv_cache_layers[layer_name] = kv_tgt_layer
	continue
	# Skip modules that don't need KV cache (eg encoder-only attention)
	if spec := attn_module.get_kv_cache_spec(self.vllm_config):
	kv_cache_spec[layer_name] = spec

Uh oh!

Conversation

NickLucche commented Jun 5, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant