[Core][KV Transfer] Support PD disagg + speculative decoding KV lifecycle#34926
[Core][KV Transfer] Support PD disagg + speculative decoding KV lifecycle#34926ZhanqiuHu wants to merge 3 commits intovllm-project:mainfrom
Conversation
When spec decode is active, the draft model's forward pass also needs the connector metadata to save its KV cache via @maybe_transfer_kv_layer. Add delay_clear param to _get_kv_connector_output so the finally block skips clear_connector_metadata(), and explicitly clear after draft proposals complete in sample_tokens(). Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
When speculative decoding is enabled with KV transfer, the connector metadata was being cleared after the target model's forward pass but before the draft model's forward. This prevented drafter KV layers from being saved/loaded. Fix: defer both wait_for_save() and clear_connector_metadata() until after the draft model forward completes via finalize_connector_and_clear(). Add E2E tests covering MTP, EAGLE, and EAGLE3 with ExampleConnector. Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
There was a problem hiding this comment.
Code Review
This pull request correctly defers the KV connector finalization to support prefill/decode disaggregation in combination with speculative decoding. The changes are well-implemented by introducing a delay_clear flag in the KVConnectorModelRunnerMixin, which is enabled when a speculative configuration is present. The deferred finalization is then correctly triggered after the draft model's forward pass. The addition of comprehensive end-to-end tests for MTP, EAGLE, and EAGLE3 methods is excellent and ensures the new functionality is robust and correct. The overall implementation is clean and effectively addresses the issue of saving drafter KV caches in a disaggregated setup.
GPT-OSS uses hybrid sliding/full attention which is incompatible with ExampleConnector's single block table slot mapping. Tests are skipped for now; GPT-OSS works with NixlConnector (raw block transfer). Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Follow-up in #35158 |
Summary
Defer KV connector finalization for P/D disaggregation + speculative decoding compatibility.
Add E2E tests for MTP (DeepSeek), EAGLE (Llama-3.1-8B), and EAGLE3 (GPT-OSS-20B) with
ExampleConnector.Purpose
The connector metadata (and
wait_for_save) was being finalized after the target model forward but before the draft model forward, preventing drafter KV from being saved/loaded correctly. This PR defers both until after the draft forward completes.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.