Skip to content

Multiple Hybrid KV Cache Coordinator#30263

Closed
roikoren755 wants to merge 4 commits intovllm-project:mainfrom
roikoren755:feat/multiple_kv_cache_coordinator
Closed

Multiple Hybrid KV Cache Coordinator#30263
roikoren755 wants to merge 4 commits intovllm-project:mainfrom
roikoren755:feat/multiple_kv_cache_coordinator

Conversation

@roikoren755
Copy link
Copy Markdown
Contributor

@roikoren755 roikoren755 commented Dec 8, 2025

Purpose

The current HybridKVCacheConnector limits the KV cache specs to exactly two types, with one being full attention and the second non-full attention (sliding window, mamba, ...).

Some of the models released by NVIDIA have more than 2 types (with at least one full attention), like vGQA (different number of KV heads per layer)

Currently, we must disable prefix caching when running these models in vLLM. This PR enables automatic prefix caching for such models, by updating the HybridKVCacheCoordinator, such that it accepts at least one full attention KV cache spec, plus at least one other spec - which could be full attention as well, but must have a different spec - and similarly to how it is today, finds the shortest cache hit and returns it, but this time with multiple "other" KV cache specs. We also don't force an order between the different KV cache specs (all full attention together up front, for example).

Test Plan

Current tests pass. Add new tests and parametrize existing ones, to accept more than 2 different KV cache specs.

Test Result

All tests pass

Performance:

Running the server on an H100 GPU with vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 --trust-remote-code, I got the following results.
GSM8K was ran with lm_eval --model local-completions --tasks gsm8k --model_args "model=nvidia/NVIDIA-Nemotron-Nano-9B-v2,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=4,max_retries=3,tokenized_requests=False" --cache_requests true --batch_size 256.
E2E with vllm bench serve --backend vllm --model nvidia/NVIDIA-Nemotron-Nano-9B-v2 --endpoint /v1/completions --dataset-name random --num-prompts 10.

Branch:
GSM8K:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6922|±  |0.0127|
|     |       |strict-match    |     5|exact_match|↑  |0.7293|±  |0.0122|

E2E:

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  1.52
Total input tokens:                      10230
Total generated tokens:                  1280
Request throughput (req/s):              6.58
Output token throughput (tok/s):         842.86
Peak output token throughput (tok/s):    749.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          7579.13
---------------Time to First Token----------------
Mean TTFT (ms):                          241.87
Median TTFT (ms):                        256.37
P99 TTFT (ms):                           290.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.98
Median TPOT (ms):                        9.87
P99 TPOT (ms):                           11.33
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.98
Median ITL (ms):                         9.65
P99 ITL (ms):                            10.11
==================================================

Main:
GSM8K:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6869|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  |0.7255|±  |0.0123|

E2E:

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  1.51
Total input tokens:                      10230
Total generated tokens:                  1280
Request throughput (req/s):              6.62
Output token throughput (tok/s):         847.47
Peak output token throughput (tok/s):    749.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          7620.57
---------------Time to First Token----------------
Mean TTFT (ms):                          241.51
Median TTFT (ms):                        256.27
P99 TTFT (ms):                           290.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.92
Median TPOT (ms):                        9.81
P99 TPOT (ms):                           11.27
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.92
Median ITL (ms):                         9.60
P99 ITL (ms):                            10.03
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roi Koren <roik@nvidia.com>
Signed-off-by: Roi Koren <roik@nvidia.com>
@roikoren755 roikoren755 force-pushed the feat/multiple_kv_cache_coordinator branch from 2817c9e to 030eeae Compare December 8, 2025 14:17
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 8, 2025

Hi @roikoren755, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Roi Koren <roik@nvidia.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 8, 2025

Hi @roikoren755, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Roi Koren <roik@nvidia.com>
@heheda12345 heheda12345 self-assigned this Dec 9, 2025
@heheda12345
Copy link
Copy Markdown
Collaborator

Per offline discussion, we will continue on this PR when more details are available.

@DarkLight1337
Copy link
Copy Markdown
Member

Closing as superseded by #31707

@roikoren755 roikoren755 deleted the feat/multiple_kv_cache_coordinator branch January 11, 2026 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants