Support Multiple KV-Cache Groups in Speculative Decoding Drafters by tomasruizt · Pull Request #12 · tomasruizt/vllm

tomasruizt · 2026-02-04T12:38:13Z

Summary

This PR enables models with multiple KV-cache groups to be used as drafters in speculative decoding. Previously, the speculative decoding infrastructure assumed a single KV-cache group, which prevented the use of architectures like Gemma3 and GPT-OSS MoE models as drafters.

Key changes:

Refactored CommonAttentionMetadata handling to support a dictionary of metadata per KV-cache group ID (CommonAttnMetadataByGid)
Added per-group slot-mapping buffers for draft model inference
Introduced layer_names_to_kv_cache_gid mapping to correctly route attention layers to their corresponding KV-cache groups

Fixes vllm-project#33133

New Test Cases

Two new end-to-end test cases validate the feature:

Gemma3 (270m): Tests a model architecture with multiple KV-cache groups (different head configurations across layers). Achieves 100% acceptance rate with VLLM_BATCH_INVARIANT=1.
GPT-OSS MoE (120b/20b): Tests MoE layer resolution in speculative decoding with different target/draft model sizes. Initially, this combination exhibited low acceptance rates due to the cold-start MoE optimization interfering with speculative decoding. This was resolved in [torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding vllm-project/vllm#33624, which disables that optimization when speculative decoding is active. This test case ensures MoE models continue to work correctly with speculative decoding.

Test Plan

Existing unit tests pass (tests/v1/spec_decode/test_eagle.py)
New e2e tests for Gemma3 and GPT-OSS pass (tests/v1/e2e/test_spec_decode.py)
Benchmarks coming next - will provide performance numbers for:
- Gemma3-27b-it with Gemma3-270m-it drafter
- GPT-OSS-120b with GPT-OSS-20b drafter
- Comparison against main branch to ensure no performance regressions in drafting code

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

…puted Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

tomasruizt added 30 commits January 28, 2026 19:35

CKPT: 86% acceptance rate

48ecbc7

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Put block_tables_by_gid in CommonAttentionMetadata

a23f4f0

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Simplify

1434bfe

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

CKPT: 100% acceptance rate

c3fbddf

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Pass block size in CommonAttentionMetadata

6246a8c

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Reduce if-else statements

874f733

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Introduce slot_mapping_buffer_by_gid

6ad3311

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Invalidate slot_mapping and block_table_tensor while they are not com…

0ced07a

…puted Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Use defaultdict for buffers

3a52f2e

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Include MetadataBuilder in CommonAttentionMetadata

01e591c

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Make code concise

575b0a6

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Minimize changes

1e6551f

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Minor refactor

d1c7932

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

remove partially wrong stringdocs

b1bd1b5

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Separate GID for draft and target

6308f76

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

100% acceptance rate with VLLM_BATCH_INVARIANT=1

e3f72b1

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Monkeypatch VLLM_BATCH_INVARIANT=1 when needed

bce1bc1

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Only use FLASH_ATTN backend for batch invariance

8770b28

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Introduce constant DRAFT_MODEL_PREFIX

5a01a30

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

missing use of constant

56227f3

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Simplify code

213d67e

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Docs change

5d7d3a4

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Appease linter

ca41d45

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Merge branch 'main' into feature/spec-decode-gemma3-2

3cc5741

Appease mypy

4875d96

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Make tests pass (tests/v1/spec_decode/test_eagle.py)

d93347f

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Remove block size from CommonAttentionMetadata

4fa8664

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Use a single CommonAttentionMetadata per KV-cache group

405cedc

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Fold usage of single CommonAttentionMetadata into group

e99e17a

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Introduce custom type

8378a05

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

tomasruizt added 6 commits February 3, 2026 19:19

Simplify set_inputs_first_pass() signature

88497b2

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>

Fix gpt-oss problem for enforce_eager=True

a7720bf

Fix for enforce_eager=False

1ee0b1d

Merge branch 'main' into feature/spec-decode-gemma3-2

9841080

Revert forward_context changes

3d713c7

Reduce changes, simplify code

966e46b

tomasruizt closed this Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Multiple KV-Cache Groups in Speculative Decoding Drafters#12

Support Multiple KV-Cache Groups in Speculative Decoding Drafters#12
tomasruizt wants to merge 36 commits intomainfrom
feature/spec-decode-gemma3-2

tomasruizt commented Feb 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomasruizt commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Test Cases

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tomasruizt commented Feb 4, 2026 •

edited

Loading