v1/engine: emit prefix-cache KV-events at hash_block_size granularity for hybrid Mamba+Attention models#43258
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces sub-block BlockStored event emission for hybrid models, ensuring that KV-cache events are fired at the hash_block_size granularity even when physical block sizes are inflated. It adds a tracking cursor for emitted blocks and updates engine metadata to reflect the effective block size. Review feedback identifies a bug where incorrect attributes are accessed on LoRARequest objects and points out a potential issue with duplicate event emissions for hybrid models.
| lora_id=(request.lora_request.adapter_id | ||
| if request.lora_request else None), | ||
| medium=MEDIUM_GPU, | ||
| lora_name=(request.lora_request.name | ||
| if request.lora_request else None), |
There was a problem hiding this comment.
The fields adapter_id and name do not exist on the LoRARequest class used in vLLM. Accessing them will raise an AttributeError when processing requests with LoRA enabled. Based on the LoRARequest definition in vllm/lora/request.py and its usage in other parts of the codebase (e.g., vllm/v1/core/kv_cache_utils.py), these should be lora_int_id and lora_name respectively.
| lora_id=(request.lora_request.adapter_id | |
| if request.lora_request else None), | |
| medium=MEDIUM_GPU, | |
| lora_name=(request.lora_request.name | |
| if request.lora_request else None), | |
| lora_id=(request.lora_request.lora_int_id | |
| if request.lora_request else None), | |
| medium=MEDIUM_GPU, | |
| lora_name=(request.lora_request.lora_name | |
| if request.lora_request else None), |
There was a problem hiding this comment.
Thanks for the review. adapter_id and name are @property aliases defined on LoRARequest at vllm/lora/request.py:41-46:
@property
def adapter_id(self):
return self.lora_int_id
@property
def name(self):
return self.lora_nameThe existing emit path at vllm/v1/core/block_pool.py:310,314 uses these same property accessors, so this change follows the convention already in the file. Happy to switch to lora_int_id/lora_name if maintainers prefer the underlying field names everywhere — let me know.
| self.num_cached_block[request.request_id] = num_full_blocks | ||
| # Also emit sub-block events so prompts smaller than the inflated | ||
| # physical block still produce a KV-event signal. | ||
| self._maybe_emit_sub_block_events(request, num_tokens) |
There was a problem hiding this comment.
This implementation introduces duplicate BlockStored events for hybrid models. When block_size > hash_block_size, the self.block_pool.cache_full_blocks call (lines 352-359) already emits BlockStored events for the newly cached physical blocks at hash_block_size granularity. By calling _maybe_emit_sub_block_events immediately after, the same hash blocks are emitted again (once as part of the physical block event and once as individual sub-block events).
This redundancy wastes bandwidth and can cause issues for downstream consumers that expect unique events per hash block. To fix this, you should ensure that _maybe_emit_sub_block_events only emits for tokens that are not yet covered by a full physical block, or coordinate with BlockPool to suppress its internal emission when sub-block tracking is active for a group.
There was a problem hiding this comment.
Thanks for flagging this. The two emission paths actually fire at different granularities for the hybrid case where block_size > hash_block_size, so they end up complementary rather than duplicating each other:
block_pool.cache_full_blocks(vllm/v1/core/block_pool.py:241-256) rebuildsblock_hasheswithBlockHashListWithBlockSize(request.block_hashes, self.hash_block_size, block_size)whenblock_size != hash_block_size, and emitsBlockStoredwithblock_size=self.block_size(the inflated value, e.g. 2176). One coarse event per physical block, hashed over the full physical-block window._maybe_emit_sub_block_events(new) walksrequest.block_hashesdirectly and emitsBlockStoredwithblock_size=hash_block_size(e.g. 64) — many fine events per physical block, each hashed over a 64-token window.
Because the hash inputs are different (a hash over 2176 tokens vs. a hash over 64 tokens), the resulting block_hashes values are distinct between the two streams; downstream consumers see them as different cached entries at different granularities, not redundant events for the same hash.
Verified empirically against vllm/vllm-openai:nightly-bf610c2f5 with a hybrid Mamba+Attention model, --block-size 64, --enable-prefix-caching, and the ZMQ subscriber from the PR description: without this change the subscriber receives only block_size=2176 events; with this change those exact same coarse events still arrive and a new stream of block_size=64 events appears alongside them (with the right parent_block_hash chain).
Consumers that only need the fine granularity can filter on block_size == hash_block_size. If maintainers think the coarse stream should be suppressed for hybrid groups when sub-block emission is active, I can gate it behind a flag — happy to discuss the right behaviour here.
|
This pull request has merge conflicts that must be resolved before it can be |
ae90ff8 to
74f7233
Compare
…r hybrid models
For hybrid Mamba+Attention models vLLM inflates cache_config.block_size to
attn_block_size so the attention page size is >= the mamba page size.
Without this patch the inflation also wipes out the finer hash_block_size
the user supplied via --block-size, so:
* resolve_kv_cache_block_sizes() returns hash_block_size == block_size,
* SingleTypeKVCacheManager.cache_blocks() only fires BlockStored events
on full inflated physical blocks,
* prompts shorter than the inflated block produce zero kv-events.
This patch:
* Preserves the user --block-size as hash_block_size on both the
Platform.check_and_update_config and EngineCore resolution paths
before the inflation runs.
* Emits synthetic BlockStored events at hash-block granularity from
SingleTypeKVCacheManager.cache_blocks() whenever block_size >
hash_block_size (only attention groups; mamba groups are explicitly
suppressed by overriding the emit hook to a noop).
Signed-off-by: Vanshil Shah <vanshils@nvidia.com>
Report hash_block_size (not spec.block_size) when sub-block emission is active so downstream KV-event consumers use the right hashing granularity. Falls back to spec.block_size when hash_block_size is unset or equals the physical block size. Pairs with the sub-block emit patch in this branch: the emit patch sends events at hash_block_size granularity, this patch makes sure consumers know that granularity. Signed-off-by: Vanshil Shah <vanshils@nvidia.com>
74f7233 to
e9c0793
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Summary
Two changes that make vLLM emit prefix-cache KV-events at the user-specified
hash_block_sizegranularity for hybrid Mamba+Attention models, wherethe physical attention block size gets inflated to satisfy the mamba page
size constraint.
Motivation
For hybrid Mamba+Attention models, vLLM inflates
cache_config.block_sizeto
attn_block_sizeso the attention page size is ≥ the mamba page size.On current
main, that inflation also overrides the finerhash_block_sizethe user requested via
--block-size, so:resolve_kv_cache_block_sizes()returnshash_block_size == block_size,SingleTypeKVCacheManager.cache_blocks()only firesBlockStoredeventson full inflated physical blocks,
kv-events,
otherwise use for routing or observability.
What this PR changes
Preserve
hash_block_sizebefore inflation. In bothPlatform.check_and_update_config(vllm/platforms/interface.py) andthe
EngineCoreresolution path (vllm/v1/engine/core.py), capture theuser-supplied
block_sizeashash_block_sizebefore inflatingblock_sizetoattn_block_size.Emit sub-block events at
hash_block_sizegranularity.SingleTypeKVCacheManager._maybe_emit_sub_block_eventsadvances aper-request cursor over hash-block boundaries and appends
BlockStoredevents to
block_pool.kv_event_queuewheneverblock_size > hash_block_size. Mamba groups override this with a noop,so only attention groups emit (matches the existing convention that
only attention groups participate in prefix-cache hashing).
Surface
hash_block_sizein metadata.EngineCoreProc.get_kv_cache_group_metadatanow reportshash_block_size(notspec.block_size) when sub-block emission isactive, so downstream kv-event consumers use the right hashing
granularity. Falls back to
spec.block_sizewhenhash_block_sizeisunset or equals the physical block size.
This is purely additive: the legacy event path (full physical blocks)
fires exactly as before. Behaviour is unchanged for any deployment that
does not configure
--kv-events-configand does not pass--block-sizesmaller than the model's effective
attn_block_size.Minimal reproducer
A small Python ZMQ subscriber is enough to see the new events. Start a
hybrid Mamba+Attention model with prefix caching and a kv-events publisher:
Run a subscriber alongside it:
Send a short request whose prompt is smaller than the inflated
attention block (e.g. ~1000 tokens when
attn_block_sizeinflates to2176):
On
maintoday: the worker logsSetting attention block size to <N> tokens to ensure that attention page size is >= mamba page size.and the subscriber receives no
BlockStoredevents for thatshort-prompt request — the prefix signal is silently dropped.
With this PR: the same worker logs
Setting attention block size to <N> tokens to ensure that attention page size is >= mamba page size (hash granularity preserved at <H>).and the subscriber receives one or more
BlockStoredevents whoseblock_sizefield equals the user-suppliedhash_block_sizerather thanthe inflated
attn_block_size. Each event carries one block_hash perhash_block_size-token chunk of the prompt, withparent_block_hashchaining the chunks together exactly the way the regular full-block path
already chains its events.
DCO
Both commits are signed-off-by Vanshil Shah
<vanshils@nvidia.com>.