-
-
Notifications
You must be signed in to change notification settings - Fork 17.7k
v1/engine: emit prefix-cache KV-events at hash_block_size granularity for hybrid Mamba+Attention models #43258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vanshilshah97
wants to merge
2
commits into
vllm-project:main
Choose a base branch
from
vanshilshah97:vanshils/kv-events-hybrid-hash-block-size
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+117
−21
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation introduces duplicate
BlockStoredevents for hybrid models. Whenblock_size > hash_block_size, theself.block_pool.cache_full_blockscall (lines 352-359) already emitsBlockStoredevents for the newly cached physical blocks athash_block_sizegranularity. By calling_maybe_emit_sub_block_eventsimmediately after, the same hash blocks are emitted again (once as part of the physical block event and once as individual sub-block events).This redundancy wastes bandwidth and can cause issues for downstream consumers that expect unique events per hash block. To fix this, you should ensure that
_maybe_emit_sub_block_eventsonly emits for tokens that are not yet covered by a full physical block, or coordinate withBlockPoolto suppress its internal emission when sub-block tracking is active for a group.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging this. The two emission paths actually fire at different granularities for the hybrid case where
block_size > hash_block_size, so they end up complementary rather than duplicating each other:block_pool.cache_full_blocks(vllm/v1/core/block_pool.py:241-256) rebuildsblock_hasheswithBlockHashListWithBlockSize(request.block_hashes, self.hash_block_size, block_size)whenblock_size != hash_block_size, and emitsBlockStoredwithblock_size=self.block_size(the inflated value, e.g. 2176). One coarse event per physical block, hashed over the full physical-block window._maybe_emit_sub_block_events(new) walksrequest.block_hashesdirectly and emitsBlockStoredwithblock_size=hash_block_size(e.g. 64) — many fine events per physical block, each hashed over a 64-token window.Because the hash inputs are different (a hash over 2176 tokens vs. a hash over 64 tokens), the resulting
block_hashesvalues are distinct between the two streams; downstream consumers see them as different cached entries at different granularities, not redundant events for the same hash.Verified empirically against
vllm/vllm-openai:nightly-bf610c2f5with a hybrid Mamba+Attention model,--block-size 64,--enable-prefix-caching, and the ZMQ subscriber from the PR description: without this change the subscriber receives onlyblock_size=2176events; with this change those exact same coarse events still arrive and a new stream ofblock_size=64events appears alongside them (with the rightparent_block_hashchain).Consumers that only need the fine granularity can filter on
block_size == hash_block_size. If maintainers think the coarse stream should be suppressed for hybrid groups when sub-block emission is active, I can gate it behind a flag — happy to discuss the right behaviour here.