[PD][Core] Fix Mamba prefix cache with PD by NickLucche · Pull Request #42547 · vllm-project/vllm

NickLucche · 2026-05-13T17:10:27Z

Fix 0% prefix cache hit rate with Mamba in PD disaggregation (all/align).
Based on #42554, real diff here NickLucche/vllm@mamba-prefix-caching-pd...NickLucche:vllm:pd-fix-apc

Bug

Mamba prefix cache reports 0% hit rate on the Decode side in PD disaggregation.

This is PD-specific. In standalone mode, allocate_new_computed_blocks is
skipped entirely (num_external_computed_tokens = 0), and null blocks only
appear later during RUNNING via remove_skipped_blocks, by which time the real
blocks are already hashed.

In PD mode, allocate_new_computed_blocks runs with
num_external_computed_tokens > 0, which pads req_blocks with null blocks
via Mamba's get_num_skipped_tokens(N) = N-1. The old code then set:

self.num_cached_block[request_id] = len(req_blocks)  # counts nulls!

When _update_waiting_for_remote_kv later called cache_blocks(), it found
num_cached_block >= num_full_blocks and early-returned — nothing was ever
hashed into the block pool, so every subsequent find_longest_cache_hit missed.

allocate_new_computed_blocks (400 tokens, block_size=128):
  get_num_skipped_tokens(400) = 399 → num_skipped_blocks = 3
  req_blocks = [null, null, null, fresh]
  num_cached_block = 3                  ← BUG: counts nulls

cache_blocks(400):
  num_full_blocks = 400 // 128 = 3
  3 >= 3 → EARLY RETURN → nothing hashed → 0% hit rate

Fix

Two changes, both in single_type_kv_cache_manager.py:

Don't count null blocks in num_cached_block

Capture len(new_computed_blocks) before the skip-slicing that strips
leading blocks. This counts only real prefix-hit blocks, not null padding:

  num_computed_blocks = len(new_computed_blocks)   # before slicing
  # ... slicing, padding, etc ...
  self.num_cached_block[request_id] = num_computed_blocks

This is a no-op for FullAttention (no skipping) and SWA (the null padding in
new_computed_blocks from find_longest_cache_hit exactly equals
num_skipped_blocks, so the count is unchanged).

Register null-block hashes in MambaManager.cache_blocks

With fix 1, cache_blocks() no longer early-returns — it iterates the null
blocks. But BlockPool.cache_full_blocks skips them (blk.is_null → continue),
so their hashes never enter the hash map.

Mamba's find_longest_cache_hit searches right-to-left through block hashes.
If null-block positions aren't in the hash map, the search misses and
hit_length drops to 0, dragging the HMA coordinator's overall hit to 0.

MambaManager.cache_blocks now registers hash → null_block entries for null positions.

Reproducer (PD disaggregation)

# D
 VLLM_NIXL_SIDE_CHANNEL_PORT=$(just port 5558) VLLM_SSM_CONV_STATE_LAYOUT=DS 
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 --port $(just port 8200) --enforce-eager --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --trust-remote-code --max-model-len 131072 --block-size 128 --enable-prefix-caching --no-disable-hybrid-kv-cache-manager --mamba-cache-mode align --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# P
VLLM_NIXL_SIDE_CHANNEL_PORT=$(just port 5557) vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 --port $(just port 8100)  --gpu-memory-utilization 0.9 --trust-remote-code --enforce-eager --max-model-len 131072 --block-size 128 --enable-prefix-caching --no-disable-hybrid-kv-cache-manager --mamba-cache-mode align --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# proxy
python vllm//tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --port $(just port 8192) --prefiller-port $(just port 8100) --decoder-port $(just port 8200)

# Send same request twice and observe D-side logs:

# D
(APIServer pid=2777847) INFO 05-13 18:03:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, **Prefix cache hit rate: 0.0%,** External prefix cache hit rate: 100.0%

Test with

  pytest tests/v1/core/test_single_type_kv_cache_manager.py -k "mamba_align" -v
  pytest tests/v1/kv_connector/unit/test_nixl_connector_hma.py -k "ssm_prefix" -v

Benchmark

A simple scenario, PD TP1, H100, Nemotron3-Nano, ~8k/1k:

vllm bench serve --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --dataset-name prefix_repetition --num-prompts 1000 \
  --base-url http://localhost:55483 --ignore-eos --max-concurrency 100 \
  --prefix-repetition-prefix-len 6000 --prefix-repetition-suffix-len 2000 \
  --prefix-repetition-num-prefixes 100 --prefix-repetition-output-len 1000

# --no-enable-prefix-caching
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             100
Benchmark duration (s):                  539.67
Total input tokens:                      8000031
Total generated tokens:                  1000000
Request throughput (req/s):              1.85
Output token throughput (tok/s):         1852.97
Peak output token throughput (tok/s):    2100.00
Peak concurrent requests:                106.00
Total token throughput (tok/s):          16676.77
---------------Time to First Token----------------
Mean TTFT (ms):                          1632.73
Median TTFT (ms):                        579.53
P99 TTFT (ms):                           18875.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.70
Median TPOT (ms):                        52.04
P99 TPOT (ms):                           52.53
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.70
Median ITL (ms):                         51.92
P99 ITL (ms):                            60.07
==================================================

# --enable-prefix-caching --mamba-cache-mode align
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             100
Benchmark duration (s):                  533.51
Total input tokens:                      8000031
Total generated tokens:                  1000000
Request throughput (req/s):              1.87
Output token throughput (tok/s):         1874.39
Peak output token throughput (tok/s):    2100.00
Peak concurrent requests:                112.00
Total token throughput (tok/s):          16869.59
---------------Time to First Token----------------
Mean TTFT (ms):                          1458.54
Median TTFT (ms):                        508.13
P99 TTFT (ms):                           16325.75
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.35
Median TPOT (ms):                        51.32
P99 TPOT (ms):                           53.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.35
Median ITL (ms):                         51.49
P99 ITL (ms):                            61.80
==================================================

mergify · 2026-05-13T17:12:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request implements support for KV transfer in Mamba hybrid models, specifically addressing challenges with heterogeneous Tensor Parallelism (TP) and prefix caching. Significant changes include updating BlockPool to register null block hashes, refining the NixlConnectorWorker to use remote physical block ratios for kernel block mapping, and introducing _apply_prefix_caching to manage block ID trimming. The PR also adds validation to disable prefix caching for Mamba hybrid models when physical block counts are heterogeneous. Review feedback highlights a design constraint in the SSM block handling where an assertion assumes a single local block, suggesting this should be better documented or handled with a descriptive error.

gemini-code-assist · 2026-05-13T17:14:58Z

+                if (
+                    _is_ssm_spec(self._group_spec_types[i])
+                    and num_local_blocks < num_remote_blocks
+                ):
+                    # NOTE (NickLucche): With prefix caching on SSM, (remote) blocks
+                    # prior to the last one are placeholders (null blocks). Mind that
+                    # this doesn't really impact transfer, as we only still care about
+                    # the last "block", the full in-place state.
+                    assert num_local_blocks == 1, "SSM can only have one local block"
+                    remote_block_ids[i] = remote_group[-num_local_blocks:]


The assertion assert num_local_blocks == 1 assumes that SSM groups can only have one local block. If this is a design constraint, it should be documented as such in the class or method docstring, or the assertion should be replaced with a more descriptive error message if it's a potential runtime failure point.

Signed-off-by: NickLucche <nlucches@redhat.com>

heheda12345 · 2026-05-15T04:56:46Z

+        # Only count non-null blocks as cached. Null blocks appear here from Mamba
+        # align-mode and SWA/chunked-local attention.
+        num_cached = sum(1 for b in req_blocks if not b.is_null)
+        self.num_cached_block[request_id] = num_cached


agree that this line have bug when delay_cache_block is True. but I think it should be set to the input len(new_computed_blocks) before this line new_computed_blocks = new_computed_blocks[num_skipped_blocks:]

thanks @heheda12345 !
This isn't quite working even after reverting block_pool. Will investigate some more asap

mergify · 2026-05-18T10:17:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2026-05-20T13:48:40Z

 from vllm.v1.core.block_pool import BlockPool
 from vllm.v1.core.kv_cache_utils import (
    BlockHashList,
+    BlockHashListWithBlockSize,
    BlockHashWithGroupId,


ignore everything which isnt this file

NickLucche · 2026-05-20T14:19:22Z

@heheda12345 @tdoublep I pulled the changes to be confined to the MambaManager.cache_blocks, but I am still having to account for those null blocks to get a hit on all groups (we mostly care about FA for xfer side).
Do you see a cleaner way to fix this?

underfituu · 2026-05-21T12:46:27Z

@heheda12345 @tdoublep I pulled the changes to be confined to the MambaManager.cache_blocks, but I am still having to account for those null blocks to get a hit on all groups (we mostly care about FA for xfer side). Do you see a cleaner way to fix this?
Hi @NickLucche @heheda12345 @tdoublep,

Regarding the cache hit 0% issue discussed here, I've proposed a solution in my PR #42524 that might help address this.

Could you please take a look and see if it aligns with what you're trying to achieve for the mamba hybrid models? Would love to get your feedback!

mergify Bot added v1 kv-connector labels May 13, 2026

mergify Bot added the needs-rebase label May 13, 2026

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

NickLucche added 2 commits May 13, 2026 17:15

prefix caching for matching block_size

a905fd6

Signed-off-by: NickLucche <nlucches@redhat.com>

partial hit for FA

7430a47

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the pd-fix-apc branch from 35f0c58 to 90e09f2 Compare May 13, 2026 17:26

mergify Bot removed the needs-rebase label May 13, 2026

test

d87d5f1

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche mentioned this pull request May 13, 2026

[PD][Nixl] Mamba prefix caching mode support #42554

Merged

avifenesh mentioned this pull request May 14, 2026

Allow LMCacheConnectorV1 to support hybrid KV loads #42620

Open

heheda12345 reviewed May 15, 2026

View reviewed changes

Comment thread vllm/v1/core/block_pool.py Outdated

NickLucche changed the title ~~[PD] Fix Mamba cache align mode with PD~~ [PD][Core] Fix Mamba prefix cache with PD May 18, 2026

mergify Bot added the needs-rebase label May 18, 2026

NickLucche added 3 commits May 20, 2026 12:55

init

b65e35f

Signed-off-by: NickLucche <nlucches@redhat.com>

comments

90fabda

Signed-off-by: NickLucche <nlucches@redhat.com>

tests

477556a

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the pd-fix-apc branch from 90e09f2 to 477556a Compare May 20, 2026 13:47

NickLucche commented May 20, 2026

View reviewed changes

mergify Bot removed the needs-rebase label May 20, 2026

hoobnn mentioned this pull request May 26, 2026

[V1][Mamba] Opt-in granular prefill to fix align-mode prefix-cache misses on incremental requests (#43587) #43628

Closed

NickLucche mentioned this pull request Jun 1, 2026

[PD][Feature] Add KV consumer partial-group caching for hybrid Mamba models #42524

Open

lHrHenry233 mentioned this pull request Jun 4, 2026

[PD][Feature] Add KV consumer partial-group caching for hybrid Mamba models vllm-project/vllm-ascend#10009

Open

underfituu mentioned this pull request Jun 4, 2026

[Community] Weekly Meeting Agenda vllm-project/vllm-ascend#3642

Open

vllm-bot closed this in #42554 Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PD][Core] Fix Mamba prefix cache with PD#42547

[PD][Core] Fix Mamba prefix cache with PD#42547
NickLucche wants to merge 6 commits into
vllm-project:mainfrom
NickLucche:pd-fix-apc

NickLucche commented May 13, 2026 •

edited

Loading

Uh oh!

mergify Bot commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

heheda12345 May 15, 2026

Uh oh!

NickLucche May 15, 2026

Uh oh!

Uh oh!

mergify Bot commented May 18, 2026

Uh oh!

NickLucche May 20, 2026

Uh oh!

NickLucche commented May 20, 2026

Uh oh!

underfituu commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

NickLucche commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug

Fix

Reproducer (PD disaggregation)

Test with

Benchmark

Uh oh!

mergify Bot commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

heheda12345 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify Bot commented May 18, 2026

Uh oh!

NickLucche May 20, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche commented May 20, 2026

Uh oh!

underfituu commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NickLucche commented May 13, 2026 •

edited

Loading