[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector by ivanium · Pull Request #30166 · vllm-project/vllm

ivanium · 2025-12-06T01:50:51Z

Purpose

This is a contiuation work along PR #23624 to support hybrid KV cache manager + KV cache connector.

Design doc with details drafted by @KuntaiDu: link

In short, the current hybrid KV cache manager will try to allocate all tokens for sliding window layers similar to full attention layers, and then in the next scheduling step, the manager will free unuseful tokens (those outside the sliding window) and turn them into prefix cache in GRAM. This PR, instead, aims to allocate KV cache only for tokens in the sliding window for sliding window layers. This addresses two issues:

When using with an external KV cache layer (e.g., LMCache), over-allocating all prefix tokens for sliding window layers will incur a high memory pressure and can fail when remaining GPU memory is insufficient;
When using with P/D disaggregation connectors, this allocate-then-free pattern will cause data contention, where the connector might copy some KV cache blocks for one request in the background but the manager frees and reuses them for another request.

This PR currently supports only LMCache connector. The support for the other connectors will be added in follow-up PRs.

cc @KuntaiDu @heheda12345

Test Plan

The test script is a modification from the one in PR #25712.

The script should be run with LMCache-side support: LMCache/LMCache#1436.

Caution

Please apply the following patch to LMCache if getting import errors for cdiv:

vLLM patch

diff --git a/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py b/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py
index 7869e08f1..a6e05da64 100644
--- a/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py
+++ b/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py
@@ -17,6 +17,7 @@ from vllm.distributed.kv_transfer.kv_connector.v1.base import (
     KVConnectorBase_V1,
     KVConnectorMetadata,
     KVConnectorRole,
+    SupportsHMA,
 )
 from vllm.logger import init_logger
 from vllm.v1.core.sched.output import SchedulerOutput
@@ -69,18 +70,21 @@ class LMCacheKVEvents(KVConnectorKVEvents):
         return f"<LMCacheKVEvents events={self.get_all_events()}>"


-class LMCacheConnectorV1(KVConnectorBase_V1):
+class LMCacheConnectorV1(KVConnectorBase_V1, SupportsHMA):
     def __init__(
         self,
         vllm_config: "VllmConfig",
         role: KVConnectorRole,
         kv_cache_config: "KVCacheConfig",
     ):
+        ## REMOVE BEFORE MERGE (YIFAN): this is temporary workaround to work with
+        # LMCache. Remove this once having LMCache-side support for new interfaces.
+        vllm_config.kv_cache_config = kv_cache_config  # type: ignore[attr-defined]
         super().__init__(
             vllm_config=vllm_config, role=role, kv_cache_config=kv_cache_config
         )
-        assert vllm_config.kv_transfer_config is not None
-        use_native = vllm_config.kv_transfer_config.get_from_extra_config(
+        assert vllm_config.kv_transfer_config is not None  # type: ignore[attr-defined]
+        use_native = vllm_config.kv_transfer_config.get_from_extra_config(  # type: ignore[attr-defined]
             "use_native", False
         )
         if use_native:
@@ -107,22 +111,6 @@ class LMCacheConnectorV1(KVConnectorBase_V1):
     # ==============================
     # Worker-side methods
     # ==============================
-    def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]):
-        """
-        Initialize with the KV caches. Useful for pre-registering the
-        KV Caches in the KVConnector (e.g. for NIXL).
-
-        Args:
-            kv_caches: dictionary of layer names, kv cache
-        """
-        if hasattr(self._lmcache_engine, "register_kv_caches"):
-            self._lmcache_engine.register_kv_caches(kv_caches)
-        else:
-            logger.warning(
-                "LMCache engine does not support register_kv_caches, "
-                "please check and use the latest version"
-            )
-
     def start_load_kv(self, forward_context: "ForwardContext", **kwargs: Any) -> None:
         """
         Start loading the KV cache from the connector to vLLM's paged
@@ -221,6 +209,9 @@ class LMCacheConnectorV1(KVConnectorBase_V1):
         """
         Get the KV connector kv cache events collected during the last interval.
         """
+        ## REMOVE BEFORE MERGE (YIFAN): this is temporary workaround to work with
+        # old versions of LMCache for testing purposes.
+        return None

         events = self._lmcache_engine.get_kv_events()  # type: ignore [attr-defined]
         if not events:
@@ -234,7 +225,6 @@ class LMCacheConnectorV1(KVConnectorBase_V1):
                 lora_id=e.lora_id,
                 block_size=e.block_size,
                 medium=e.medium,
-                lora_name=e.lora_name,
             )
             for e in events
         ]
@@ -327,6 +317,20 @@ class LMCacheConnectorV1(KVConnectorBase_V1):
             Optional KVTransferParams to be included in the request outputs
             returned by the engine.
         """
+        # NOTE: LMCache overloads request_finished so `block_ids` here can be
+        # either list[int] or tuple[list[int], ...].
+        return self._lmcache_engine.request_finished(request, block_ids)
+
+    ## REMOVE BEFORE MERGE (YIFAN): this is temporary workaround to work with
+    # LMCache. Remove this once having LMCache-side support for new interfaces.
+    def request_finished_all_groups(
+        self,
+        request: "Request",
+        block_ids: tuple[list[int], ...],
+    ) -> tuple[bool, dict[str, Any] | None]:
+        # NOTE: LMCache overloads request_finished so `block_ids` here can be
+        # either list[int] or tuple[list[int], ...]. This could be changed in
+        # the future to separate these two methods.
         return self._lmcache_engine.request_finished(request, block_ids)

     def take_events(self) -> Iterable["KVCacheEvent"]:

lmcache-patch

diff --git a/lmcache/integration/vllm/vllm_v1_adapter.py b/lmcache/integration/vllm/vllm_v1_adapter.py
index a849097..4db64df 100644
--- a/lmcache/integration/vllm/vllm_v1_adapter.py
+++ b/lmcache/integration/vllm/vllm_v1_adapter.py
@@ -18,7 +18,10 @@ from vllm.distributed.parallel_state import (
     get_tp_group,
 )
 from vllm.sampling_params import SamplingParams
-from vllm.utils import cdiv
+try:
+    from vllm.utils import cdiv
+except ImportError:
+    from vllm.utils.math_utils import cdiv

To run this script on H100, please save the following code into test_connector_w_hybrid_kv_allocator.py, and python test_connector_w_hybrid_kv_allocator.py.

`test_connector_w_hybrid_kv_allocator.py`

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import os

# Set token chunk size to 256
os.environ["LMCACHE_CHUNK_SIZE"] = "256"
# Enable CPU memory backend
os.environ["LMCACHE_LOCAL_CPU"] = "True"
# Set CPU memory limit to 5GB
os.environ["LMCACHE_MAX_LOCAL_CPU_SIZE"] = "20.0"
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
os.environ["LMCACHE_USE_LAYERWISE"] = "True"


from vllm import LLM, SamplingParams
from vllm.config import KVTransferConfig

# Configure KV cache transfer to use LMCache
ktc = KVTransferConfig(
    kv_connector="LMCacheConnectorV1",
    kv_role="kv_both",
)

# Initialize LLM with LMCache configuration
# Adjust gpu_memory_utilization based on your GPU memory
# Parameters below are for 80GB GPUs
llm = LLM(
    model="google/gemma-3-4b-it",
    kv_transfer_config=ktc,
    max_model_len=75000,
    gpu_memory_utilization=0.28,
    # gpu_memory_utilization=0.4,
    # gpu_memory_utilization=0.8,
    max_num_seqs=16,
    enforce_eager=True,
    disable_hybrid_kv_cache_manager=False,
)

# Define sampling parameters
sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=10)

# Run inference
print("Generate request 1. This will store long prefix in LMCache.")
outputs = llm.generate("hi" * 70000 + "\nhow are you?", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

# This requires loading KV cache and will succeed
print("Generate request 2. This will load prefix from LMCache and succeed.")
outputs = llm.generate("hi" * 10000 + "\nTell me a story.", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

# flush out prefix cache in GPU
print("Generate request 3. This will evict prefix cache in GPU.")
outputs = llm.generate("1" + "hi" * 70000 + "\nhow are you?", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

# This requires loading KV cache
# but this request cannot be executed as vLLM cannot allocate for long prefix
# stored by LMCache
print("Generate request 4. This will attempt to load long prefix from LMCache.")
outputs = llm.generate("hi" * 70000 + "\nTell me a story.", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

print("All requests finished.")

Test Result

Previous, we cannot allocate KV cache for the 3rd request which tries to allocate long prefixes and load external KV cache even for sliding window layers. With this PR, the 3rd request can allocate only KV caches needed for sliding window layers and is able to be scheduled and finish with correct results.

Detailed output

Generate request 1. This will store long prefix in LMCache.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.21it/s]
Processed prompts:   0%|                                    | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][2025-12-05 16:50:01,689] LMCache INFO: Reqid: 0, Total tokens 70006, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1262:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:01,713] LMCache INFO: Post-initializing LMCacheEngine (cache_engine.py:176:lmcache.v1.cache_engine)
[2025-12-05 16:50:01,748] LMCache INFO: Storing KV cache for 16384 out of 16384 tokens (skip_leading_tokens=0) for request 0 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:01,754] LMCache INFO: Lazily initializing GPU buffer. (gpu_connector.py:1075:lmcache.v1.gpu_connector)
[2025-12-05 16:50:01,754] LMCache INFO: Lazily initializing GPU buffer (max tokens=355120). (gpu_connector.py:1098:lmcache.v1.gpu_connector)
[2025-12-05 16:50:03,694] LMCache INFO: Storing KV cache for 16384 out of 32768 tokens (skip_leading_tokens=16384) for request 0 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:07,326] LMCache INFO: Storing KV cache for 16384 out of 49152 tokens (skip_leading_tokens=32768) for request 0 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:12,649] LMCache INFO: Storing KV cache for 16384 out of 65536 tokens (skip_leading_tokens=49152) for request 0 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:19,642] LMCache INFO: Storing KV cache for 4470 out of 70006 tokens (skip_leading_tokens=65536) for request 0 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
[rank0]:W1205 16:50:21.852000 3690043 .venv/lib/python3.13/site-packages/torch/_dynamo/convert_frame.py:1358] [0/8] torch._dynamo hit config.recompile_limit (8)
[rank0]:W1205 16:50:21.852000 3690043 .venv/lib/python3.13/site-packages/torch/_dynamo/convert_frame.py:1358] [0/8]    function: 'forward_static' (/data/yifanqiao/code/vllm/vllm/model_executor/layers/layernorm.py:274)
[rank0]:W1205 16:50:21.852000 3690043 .venv/lib/python3.13/site-packages/torch/_dynamo/convert_frame.py:1358] [0/8]    last reason: 0/7: expected type of 'residual' to be a tensor type, ' but found <class 'NoneType'>
[rank0]:W1205 16:50:21.852000 3690043 .venv/lib/python3.13/site-packages/torch/_dynamo/convert_frame.py:1358] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank0]:W1205 16:50:21.852000 3690043 .venv/lib/python3.13/site-packages/torch/_dynamo/convert_frame.py:1358] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html
Processed prompts: 100%|█████████████████████████| 1/1 [00:20<00:00, 20.59s/it, est. speed input: 3400.30 toks/s, output: 0.49 toks/s]
Generated text: '\nI am doing well, thank you for asking'
Generate request 2. This will load prefix from LMCache and succeed.
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 104.00it/s]
Processed prompts:   0%|                                    | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][2025-12-05 16:50:22,290] LMCache INFO: Reqid: 1, Total tokens 10007, LMCache hit tokens: 9984, need to load: 9984 (vllm_v1_adapter.py:1262:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:22,361] LMCache INFO: Retrieved 9984 out of 9984 out of total 9984 tokens (cache_engine.py:645:lmcache.v1.cache_engine)
[2025-12-05 16:50:22,361] LMCache INFO: Retrieved 9984 tokens (vllm_v1_adapter.py:978:lmcache.integration.vllm.vllm_v1_adapter)
Processed prompts: 100%|███████████████████████| 1/1 [00:00<00:00,  3.21it/s, est. speed input: 32128.69 toks/s, output: 32.11 toks/s]
Generated text: "\nOkay, here's a story for you"
Generate request 3. This will evict prefix cache in GPU.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.21it/s]
Processed prompts:   0%|                                    | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][2025-12-05 16:50:22,665] LMCache INFO: Reqid: 2, Total tokens 70007, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1262:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:22,707] LMCache INFO: Storing KV cache for 16384 out of 16384 tokens (skip_leading_tokens=0) for request 2 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:24,647] LMCache INFO: Storing KV cache for 16384 out of 32768 tokens (skip_leading_tokens=16384) for request 2 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:28,280] LMCache INFO: Storing KV cache for 16384 out of 49152 tokens (skip_leading_tokens=32768) for request 2 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:33,595] LMCache INFO: Storing KV cache for 16384 out of 65536 tokens (skip_leading_tokens=49152) for request 2 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:40,588] LMCache INFO: Storing KV cache for 4471 out of 70007 tokens (skip_leading_tokens=65536) for request 2 (vllm_v1_adapter.py:1059:lmcache.integration.vllm.vllm_v1_adapter)
Processed prompts: 100%|█████████████████████████| 1/1 [00:20<00:00, 20.54s/it, est. speed input: 3408.18 toks/s, output: 0.49 toks/s]
Generated text: '\n\nI am doing well, thank you for asking'
Generate request 4. This will attempt to load long prefix from LMCache.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.73it/s]
Processed prompts:   0%|                                    | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][2025-12-05 16:50:43,298] LMCache INFO: Reqid: 3, Total tokens 70007, LMCache hit tokens: 69888, need to load: 69888 (vllm_v1_adapter.py:1262:lmcache.integration.vllm.vllm_v1_adapter)
[2025-12-05 16:50:43,530] LMCache INFO: Retrieved 69888 out of 69888 out of total 69888 tokens (cache_engine.py:645:lmcache.v1.cache_engine)
[2025-12-05 16:50:43,530] LMCache INFO: Retrieved 69888 tokens (vllm_v1_adapter.py:978:lmcache.integration.vllm.vllm_v1_adapter)
Processed prompts: 100%|██████████████████████| 1/1 [00:00<00:00,  1.47it/s, est. speed input: 102891.47 toks/s, output: 14.70 toks/s]
Generated text: '\nOkay, here’s a story for you'
All requests finished.

I also benchmarked the performance impact of this PR with gemma3 series. There is barely no latency and throughput difference.

This branch

Model	TP	req/s	out_tok/s	TTFT ms (mean/med/p99)	TPOT ms (mean/med/p99)	ITL ms (mean/med/p99)
google/gemma-3-1b-it	1	2.988	764.954	21.816 / 21.795 / 28.320	2.655 / 2.574 / 2.970	2.655 / 2.576 / 3.875
google/gemma-3-4b-it	1	2.976	761.906	34.949 / 34.678 / 40.052	5.448 / 5.442 / 5.603	5.448 / 5.215 / 19.525
google/gemma-3-12b-it	1	2.943	753.459	70.840 / 70.790 / 90.980	15.201 / 15.209 / 15.395	15.201 / 13.656 / 51.639
google/gemma-3-27b-it	2	2.932	750.505	83.956 / 83.554 / 100.629	18.564 / 18.387 / 19.254	18.564 / 16.087 / 62.928

main

Model	TP	req/s	out_tok/s	TTFT ms (mean/med/p99)	TPOT ms (mean/med/p99)	ITL ms (mean/med/p99)
google/gemma-3-1b-it	1	2.988	764.944	21.083 / 20.572 / 29.352	2.560 / 2.553 / 2.787	2.572 / 2.526 / 3.361
google/gemma-3-4b-it	1	2.976	761.849	33.364 / 32.789 / 51.010	5.497 / 5.498 / 5.739	5.518 / 5.266 / 19.585
google/gemma-3-12b-it	1	2.943	753.377	69.731 / 69.729 / 78.577	15.251 / 15.247 / 15.687	15.251 / 13.698 / 51.842
google/gemma-3-27b-it	2	2.932	750.505	83.436 / 83.394 / 94.037	18.248 / 18.344 / 18.465	18.248 / 15.806 / 61.996

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces support for using the hybrid KV cache allocator with a KV cache connector, which is a significant enhancement for models with sliding window attention. The goal is to reduce memory pressure and prevent data contention by allocating KV cache blocks more precisely. The changes are extensive, modifying the core allocation logic in SingleTypeKVCacheManager and propagating these changes up to the KVCacheCoordinator and Scheduler. While the overall approach is sound, the implementation contains several temporary workarounds and comments marked as "REMOVE BEFORE MERGE", which are critical to address. I've identified issues in the KV connector factory, the LMCache connector implementation, and potential bugs or data correctness concerns in single_type_kv_cache_manager.py and block_pool.py. These must be resolved to ensure the stability and correctness of the new functionality.

gemini-code-assist · 2025-12-06T01:52:06Z

vllm/distributed/kv_transfer/kv_connector/factory.py

+            ## REMOVE BEFORE MERGE (YIFAN): Revert this warning back to raising
+            # an ValueError.
+            logger.warning(
+                "Connector %s does not support HMA but HMA is enabled. Please set "
+                "--disable-hybrid-kv-cache-manager to disable HMA.",
+                connector_cls.__name__,
            )


This change from raising a ValueError to a logger.warning is marked with a "REMOVE BEFORE MERGE" comment. Using a connector that does not support Hybrid Memory Allocation (HMA) when HMA is enabled can lead to incorrect behavior or hard-to-debug runtime errors. It is much safer to fail fast with an exception. This change should be reverted to raise ValueError before merging to prevent potential issues in production.

raise ValueError( f"Connector {connector_cls.__name__} does not support HMA but " f"HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`. )

gemini-code-assist · 2025-12-06T01:52:06Z

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py

+        ## REMOVE BEFORE MERGE (YIFAN): this is temporary workaround to work with
+        # LMCache. Remove this once having LMCache-side support for new interfaces.
+        vllm_config.kv_cache_config = kv_cache_config


This block contains a "REMOVE BEFORE MERGE" comment, indicating a temporary workaround. Directly modifying vllm_config by assigning to kv_cache_config is a side effect that can lead to unexpected behavior elsewhere in the system. This workaround should be removed, and a proper solution that avoids mutating the config object should be implemented as noted in the comment.

gemini-code-assist · 2025-12-06T01:52:06Z

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py

+    ## REMOVE BEFORE MERGE (YIFAN): this is temporary workaround to work with
+    # LMCache. Remove this once having LMCache-side support for new interfaces.
+    def request_finished_all_groups(
+        self,
+        request: "Request",
+        block_ids: tuple[list[int], ...],
+    ) -> tuple[bool, dict[str, Any] | None]:
+        # NOTE: LMCache overloads request_finished so `block_ids` here can be
+        # either list[int] or tuple[list[int], ...]. This could be changed in
+        # the future to separate these two methods.
        return self._lmcache_engine.request_finished(request, block_ids)


The request_finished_all_groups method is marked as a temporary workaround with a "REMOVE BEFORE MERGE" comment. It appears to be a shim for a new interface required by the hybrid allocator. This temporary implementation should be replaced with a proper solution, and the dependency on this fix in LMCache should be resolved before this pull request is merged.

vllm/v1/core/single_type_kv_cache_manager.py

vllm/v1/core/block_pool.py

mergify · 2025-12-06T05:51:34Z

Hi @ivanium, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

KuntaiDu · 2025-12-07T07:10:27Z

Good work! In terms of landing this PR, @heheda12345 previously suggested me to separate into small PRs and I would prefer the same for this PR.

Example:
Pr 1: don't change the allocation logic at all, simply introduce num_connector_tokens into the allocation API suite, and change the function correspondingly.
Pr 2: build abstractions (example like the get_num_skipped_tokens)
Pr 3: make the estimation of # of blocks accurate
Pr 4: change the allocation logic

vllm/v1/core/block_pool.py

vllm/v1/core/kv_cache_coordinator.py

vllm/v1/core/single_type_kv_cache_manager.py

vllm/v1/core/kv_cache_manager.py

…efix_lm issue Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

…a request finishes Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

…ecoding cases, where draft tokens are allocated and later rejected Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

…che connector (vllm-project#30166) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Co-authored-by: KuntaiDu <kuntai@uchicago.edu>

…che connector (vllm-project#30166) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Co-authored-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

mergify bot added v1 tpu Related to Google TPUs kv-connector labels Dec 6, 2025

gemini-code-assist bot reviewed Dec 6, 2025

View reviewed changes

ivanium marked this pull request as ready for review December 6, 2025 05:48

ivanium requested review from ApostaC, NickLucche, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners December 6, 2025 05:48

ivanium force-pushed the feat/partial_ext_token_hit branch from 223fb4d to fa53140 Compare December 6, 2025 06:03

heheda12345 reviewed Dec 8, 2025

View reviewed changes

vllm/v1/core/block_pool.py Outdated Show resolved Hide resolved

heheda12345 reviewed Dec 8, 2025

View reviewed changes

vllm/v1/core/block_pool.py Outdated Show resolved Hide resolved

heheda12345 reviewed Dec 8, 2025

View reviewed changes

vllm/v1/core/kv_cache_coordinator.py Outdated Show resolved Hide resolved

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

vllm/v1/core/single_type_kv_cache_manager.py Outdated Show resolved Hide resolved

heheda12345 reviewed Dec 9, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

heheda12345 reviewed Dec 9, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

heheda12345 reviewed Dec 9, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

heheda12345 reviewed Dec 9, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

ivanium added 15 commits December 26, 2025 05:08

fix: avoid memory leak in remove_skipped_blocks; workaround gemma3 pr…

77cf5ff

…efix_lm issue Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

nits: revise function name and comments

30e5673

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

nits

244b993

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: remove skipped blocks before passing them to the connector when …

c69caaa

…a request finishes Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: should use total_computed_tokens for get_num_skipped_tokens()

cb05716

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

perf: fast path for decode reqs in get_num_blocks_to_allocate()

867f1fd

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: rename stale func names

c03fc57

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

various minor fixes

7727c47

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

refactor: simplify get_num_blocks_to_allocate

4e8381e

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

nits

4afafba

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

nits

6673b11

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

nits

7c9f329

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

chore: clean up debug code

b917624

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

nits

d79e598

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: num_required_blocks can be smaller than num_req_blocks in spec d…

006d49c

…ecoding cases, where draft tokens are allocated and later rejected Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

ivanium force-pushed the feat/partial_ext_token_hit branch from 9d0bed1 to 006d49c Compare December 26, 2025 06:02

heheda12345 enabled auto-merge (squash) December 26, 2025 17:55

simon-mo disabled auto-merge December 27, 2025 02:25

simon-mo merged commit 52bf066 into vllm-project:main Dec 27, 2025
48 of 50 checks passed

github-project-automation bot moved this to Done in Structured Output Dec 27, 2025

github-project-automation bot moved this to Done in Tool Calling Dec 27, 2025

github-project-automation bot moved this from Ready to Done in NVIDIA Dec 27, 2025

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Dec 27, 2025

ivanium deleted the feat/partial_ext_token_hit branch December 27, 2025 02:41

NickLucche mentioned this pull request Jan 6, 2026

[Core][NIXL] Support HMA+NixlConnector #31802

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector#30166

[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector#30166
simon-mo merged 34 commits intovllm-project:mainfrom
ivanium:feat/partial_ext_token_hit

ivanium commented Dec 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

gemini-code-assist bot Dec 6, 2025

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

KuntaiDu commented Dec 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ivanium commented Dec 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

KuntaiDu commented Dec 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivanium commented Dec 6, 2025 •

edited by github-actions bot

Loading