Skip to content

[Core][KV Transfer] Support PD disagg + speculative decoding KV lifecycle#34926

Closed
ZhanqiuHu wants to merge 3 commits intovllm-project:mainfrom
ZhanqiuHu:pd-sd-delay-clear
Closed

[Core][KV Transfer] Support PD disagg + speculative decoding KV lifecycle#34926
ZhanqiuHu wants to merge 3 commits intovllm-project:mainfrom
ZhanqiuHu:pd-sd-delay-clear

Conversation

@ZhanqiuHu
Copy link
Contributor

@ZhanqiuHu ZhanqiuHu commented Feb 19, 2026

Summary

Defer KV connector finalization for P/D disaggregation + speculative decoding compatibility.
Add E2E tests for MTP (DeepSeek), EAGLE (Llama-3.1-8B), and EAGLE3 (GPT-OSS-20B) with ExampleConnector.

Purpose

The connector metadata (and wait_for_save) was being finalized after the target model forward but before the draft model forward, preventing drafter KV from being saved/loaded correctly. This PR defers both until after the draft forward completes.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

When spec decode is active, the draft model's forward pass also needs
the connector metadata to save its KV cache via @maybe_transfer_kv_layer.
Add delay_clear param to _get_kv_connector_output so the finally block
skips clear_connector_metadata(), and explicitly clear after draft
proposals complete in sample_tokens().

Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
When speculative decoding is enabled with KV transfer, the connector
metadata was being cleared after the target model's forward pass but
before the draft model's forward. This prevented drafter KV layers
from being saved/loaded.

Fix: defer both wait_for_save() and clear_connector_metadata() until
after the draft model forward completes via finalize_connector_and_clear().

Add E2E tests covering MTP, EAGLE, and EAGLE3 with ExampleConnector.

Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly defers the KV connector finalization to support prefill/decode disaggregation in combination with speculative decoding. The changes are well-implemented by introducing a delay_clear flag in the KVConnectorModelRunnerMixin, which is enabled when a speculative configuration is present. The deferred finalization is then correctly triggered after the draft model's forward pass. The addition of comprehensive end-to-end tests for MTP, EAGLE, and EAGLE3 methods is excellent and ensures the new functionality is robust and correct. The overall implementation is clean and effectively addresses the issue of saving drafter KV caches in a disaggregated setup.

GPT-OSS uses hybrid sliding/full attention which is incompatible with
ExampleConnector's single block table slot mapping. Tests are skipped
for now; GPT-OSS works with NixlConnector (raw block transfer).

Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
@mergify
Copy link

mergify bot commented Feb 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ZhanqiuHu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@ZhanqiuHu
Copy link
Contributor Author

Follow-up in #35158

@ZhanqiuHu ZhanqiuHu closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant