Multiple Hybrid KV Cache Coordinator by roikoren755 · Pull Request #30263 · vllm-project/vllm

roikoren755 · 2025-12-08T14:09:50Z

Purpose

The current HybridKVCacheConnector limits the KV cache specs to exactly two types, with one being full attention and the second non-full attention (sliding window, mamba, ...).

Some of the models released by NVIDIA have more than 2 types (with at least one full attention), like vGQA (different number of KV heads per layer)

Currently, we must disable prefix caching when running these models in vLLM. This PR enables automatic prefix caching for such models, by updating the HybridKVCacheCoordinator, such that it accepts at least one full attention KV cache spec, plus at least one other spec - which could be full attention as well, but must have a different spec - and similarly to how it is today, finds the shortest cache hit and returns it, but this time with multiple "other" KV cache specs. We also don't force an order between the different KV cache specs (all full attention together up front, for example).

Test Plan

Current tests pass. Add new tests and parametrize existing ones, to accept more than 2 different KV cache specs.

Test Result

All tests pass

Performance:

Running the server on an H100 GPU with vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 --trust-remote-code, I got the following results.
GSM8K was ran with lm_eval --model local-completions --tasks gsm8k --model_args "model=nvidia/NVIDIA-Nemotron-Nano-9B-v2,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=4,max_retries=3,tokenized_requests=False" --cache_requests true --batch_size 256.
E2E with vllm bench serve --backend vllm --model nvidia/NVIDIA-Nemotron-Nano-9B-v2 --endpoint /v1/completions --dataset-name random --num-prompts 10.

Branch:
GSM8K:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6922|±  |0.0127|
|     |       |strict-match    |     5|exact_match|↑  |0.7293|±  |0.0122|

E2E:

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  1.52
Total input tokens:                      10230
Total generated tokens:                  1280
Request throughput (req/s):              6.58
Output token throughput (tok/s):         842.86
Peak output token throughput (tok/s):    749.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          7579.13
---------------Time to First Token----------------
Mean TTFT (ms):                          241.87
Median TTFT (ms):                        256.37
P99 TTFT (ms):                           290.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.98
Median TPOT (ms):                        9.87
P99 TPOT (ms):                           11.33
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.98
Median ITL (ms):                         9.65
P99 ITL (ms):                            10.11
==================================================

Main:
GSM8K:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6869|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  |0.7255|±  |0.0123|

E2E:

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  1.51
Total input tokens:                      10230
Total generated tokens:                  1280
Request throughput (req/s):              6.62
Output token throughput (tok/s):         847.47
Peak output token throughput (tok/s):    749.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          7620.57
---------------Time to First Token----------------
Mean TTFT (ms):                          241.51
Median TTFT (ms):                        256.27
P99 TTFT (ms):                           290.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.92
Median TPOT (ms):                        9.81
P99 TPOT (ms):                           11.27
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.92
Median ITL (ms):                         9.60
P99 ITL (ms):                            10.03
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roi Koren <roik@nvidia.com>

mergify · 2025-12-08T14:38:32Z

Hi @roikoren755, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Roi Koren <roik@nvidia.com>

mergify · 2025-12-08T15:18:58Z

Hi @roikoren755, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Roi Koren <roik@nvidia.com>

heheda12345 · 2025-12-10T19:30:12Z

Per offline discussion, we will continue on this PR when more details are available.

DarkLight1337 · 2026-01-11T08:18:04Z

Closing as superseded by #31707

roikoren755 requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners December 8, 2025 14:09

mergify bot added the v1 label Dec 8, 2025

roikoren755 added 2 commits December 8, 2025 16:17

Enable prefix-caching with >2 KV cache specs

838c0b7

Signed-off-by: Roi Koren <roik@nvidia.com>

Add tests

030eeae

Signed-off-by: Roi Koren <roik@nvidia.com>

roikoren755 force-pushed the feat/multiple_kv_cache_coordinator branch from 2817c9e to 030eeae Compare December 8, 2025 14:17

Fix pre-commit

49cabd4

Signed-off-by: Roi Koren <roik@nvidia.com>

Fix mypy

55f3164

Signed-off-by: Roi Koren <roik@nvidia.com>

heheda12345 self-assigned this Dec 9, 2025

ivanium mentioned this pull request Jan 5, 2026

[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator #31707

Merged

5 tasks

DarkLight1337 closed this Jan 11, 2026

roikoren755 deleted the feat/multiple_kv_cache_coordinator branch January 11, 2026 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiple Hybrid KV Cache Coordinator#30263

Multiple Hybrid KV Cache Coordinator#30263
roikoren755 wants to merge 4 commits intovllm-project:mainfrom
roikoren755:feat/multiple_kv_cache_coordinator

roikoren755 commented Dec 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Dec 8, 2025

Uh oh!

mergify bot commented Dec 8, 2025

Uh oh!

heheda12345 commented Dec 10, 2025

Uh oh!

DarkLight1337 commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

roikoren755 commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Performance:

Uh oh!

mergify bot commented Dec 8, 2025

Uh oh!

mergify bot commented Dec 8, 2025

Uh oh!

heheda12345 commented Dec 10, 2025

Uh oh!

DarkLight1337 commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

roikoren755 commented Dec 8, 2025 •

edited by github-actions bot

Loading