Fix DFlash first prefill lookahead allocation#41971
Conversation
Signed-off-by: George-ao <1586028831@qq.com>
|
My current understanding is that DFlash needs lookahead KV/slot allocation during first prefill because it runs draft proposal in the same model runner step as the target prefill. During the first DFlash draft proposal, DFlash already needs KV slots for draft-token positions after the prompt. I added Please correct me if I got any part of the DFlash flow wrong. |
There was a problem hiding this comment.
Code Review
This pull request introduces DFlash support in the V1 scheduler, enabling lookahead token allocation during the first prefill step. It also adds a new test suite, test_dflash_slot_mapping.py, to verify that DFlash query slots address request-owned blocks. A critical feedback was provided regarding the initialization of num_lookahead_tokens when DFlash is enabled, as its omission would result in zero effective lookahead tokens.
| if speculative_config.use_dflash(): | ||
| self.use_dflash = True |
There was a problem hiding this comment.
The num_lookahead_tokens is not initialized when use_dflash is true, which causes effective_lookahead_tokens to be 0 even when use_dflash is enabled. It should be set to self.num_spec_tokens to ensure lookahead slots are allocated.
| if speculative_config.use_dflash(): | |
| self.use_dflash = True | |
| if speculative_config.use_dflash(): | |
| self.use_dflash = True | |
| self.num_lookahead_tokens = self.num_spec_tokens |
There was a problem hiding this comment.
num_lookahead_tokens is already initialized for DFlash because SpeculativeConfig.use_eagle() currently returns true for "dflash", and that branch sets self.num_lookahead_tokens = self.num_spec_tokens before the new use_dflash() branch runs. The new self.use_dflash flag is only used later to keep first-prefill lookahead enabled for DFlash.
Purpose
Fix DFlash first-prefill lookahead allocation.
DFlash needs draft KV slots during the first prefill step. The scheduler should therefore allocate lookahead slots/blocks for DFlash even when
num_computed_tokens == 0.This PR also adds a focused test that connects scheduler allocation output to the real DFlash input expansion kernel and verifies the generated query slots are request-owned.
Test Plan
Test Result
pass