Skip to content

Support Multiple KV-Cache Groups in Speculative Decoding Drafters#12

Closed
tomasruizt wants to merge 36 commits intomainfrom
feature/spec-decode-gemma3-2
Closed

Support Multiple KV-Cache Groups in Speculative Decoding Drafters#12
tomasruizt wants to merge 36 commits intomainfrom
feature/spec-decode-gemma3-2

Conversation

@tomasruizt
Copy link
Copy Markdown
Owner

@tomasruizt tomasruizt commented Feb 4, 2026

Summary

This PR enables models with multiple KV-cache groups to be used as drafters in speculative decoding. Previously, the speculative decoding infrastructure assumed a single KV-cache group, which prevented the use of architectures like Gemma3 and GPT-OSS MoE models as drafters.

Key changes:

  • Refactored CommonAttentionMetadata handling to support a dictionary of metadata per KV-cache group ID (CommonAttnMetadataByGid)
  • Added per-group slot-mapping buffers for draft model inference
  • Introduced layer_names_to_kv_cache_gid mapping to correctly route attention layers to their corresponding KV-cache groups

Fixes vllm-project#33133

New Test Cases

Two new end-to-end test cases validate the feature:

  1. Gemma3 (270m): Tests a model architecture with multiple KV-cache groups (different head configurations across layers). Achieves 100% acceptance rate with VLLM_BATCH_INVARIANT=1.

  2. GPT-OSS MoE (120b/20b): Tests MoE layer resolution in speculative decoding with different target/draft model sizes. Initially, this combination exhibited low acceptance rates due to the cold-start MoE optimization interfering with speculative decoding. This was resolved in [torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding vllm-project/vllm#33624, which disables that optimization when speculative decoding is active. This test case ensures MoE models continue to work correctly with speculative decoding.

Test Plan

  • Existing unit tests pass (tests/v1/spec_decode/test_eagle.py)
  • New e2e tests for Gemma3 and GPT-OSS pass (tests/v1/e2e/test_spec_decode.py)
  • Benchmarks coming next - will provide performance numbers for:
    • Gemma3-27b-it with Gemma3-270m-it drafter
    • GPT-OSS-120b with GPT-OSS-20b drafter
    • Comparison against main branch to ensure no performance regressions in drafting code

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
…puted

Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
@tomasruizt tomasruizt closed this Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Using GPT OSS 20B as Drafter Throws Error

1 participant