[HMA] [KVEvent] Enable GPU-side KV events for HMA#37688
[HMA] [KVEvent] Enable GPU-side KV events for HMA#37688orozery merged 16 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an evicted_groups field to the BlockRemoved KV cache event to support Hybrid Model Architecture (HMA) aware prefix-cache routing. The implementation correctly populates this field for GPU-side block evictions. However, for blocks evicted via offloading managers (ARCOffloadingManager and LRUOffloadingManager), the group information is not populated, which significantly limits the feature's effectiveness in scenarios involving KV cache offloading. I've added high-severity comments to highlight this functional gap.
|
This pull request has merge conflicts that must be resolved before it can be |
|
Moving to draft for now as need to update this PR further. |
9bc6b63 to
63a9bf9
Compare
0001e88 to
4c231e7
Compare
4c231e7 to
cb22782
Compare
|
Documentation preview: https://vllm--37688.org.readthedocs.build/en/37688/ |
Evicted group information is important for routing HMA aware prefix-cache in distributed serving frameworks as vLLM can evict SWA blocks while retaining the the full-attention blocks. Without this information it would assume complete eviction and miss valid routing. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comment: - vllm-project#37688 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Recommendation from review as more work needs to be done to enable HMA for CPU side and will be done in vllm-project#38453 Review comment: - vllm-project#37688 (comment) - vllm-project#37688 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comment: - vllm-project#37688 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Using value of: self.group_idx if self.group_idx else None would return same hash for group_idx == 0 and group_idx == None because they are both Falsey. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
b2dc540 to
8a7c373
Compare
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
8a7c373 to
110733b
Compare
|
This doesn't look like an error caused by the PR in https://buildkite.com/vllm/ci/builds/59637/steps/canvas?sid=019d5333-4316-40e3-85e8-ff134f99245d but more like an issue on the system: [...]
[2026-04-03T12:13:55Z] [rank1]: File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 644, in _check_enough_kv_cache_memory
[2026-04-03T12:13:55Z] [rank1]: raise ValueError(
[2026-04-03T12:13:55Z] [rank1]: ValueError: To serve at least one request with the models's max seq len (4096), (0.5 GiB KV cache is needed, which is larger than the available KV cache memory (0.48 GiB). Based on the available memory, the estimated maximum model length is 3888. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details. |
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com> Co-authored-by: Or Ozeri <or@ozery.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com> Co-authored-by: Or Ozeri <or@ozery.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com> Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com> Co-authored-by: Or Ozeri <or@ozery.com>
Purpose
Evicted group information is important for routing Hybrid Model Architecture (HMA) aware prefix-cache in distributed serving frameworks as vLLM can evict Sliding Window Attention (SWA) blocks while retaining the full-attention blocks. Without this information it would assume complete eviction and miss valid routing paths.
This PR enables KV events for HMA on GPU-side.
Test Plan
Run bench marking with KV events enabled:
VLLM_LOG_STATS_INTERVAL=0.01 vllm bench throughput --model openai/gpt-oss-20b --num-prompts 1000 --kv-events-config '{"enable_kv_cache_events": "True", "publisher": "zmq", "topic": "kv-events"}'ORVLLM_LOG_STATS_INTERVAL=0.01 vllm bench throughput --model Qwen/Qwen3-14B --kv-offloading-size 10 --disable-hybrid-kv-cache-manager --num-prompts 1000 --kv-events-config '{"enable_kv_cache_events": "True", "publisher": "zmq", "topic": "kv-events"}'Use the updated example client to receive the events:
https://github.com/hickeyma/vllm/blob/9ff97922c104117e1577cf9554902a73d2658391/examples/online_serving/kv_events_subscriber.py
Test Result
Receive stored and removed events in the client with the new group fields.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.