Skip to content

Fix DFlash first prefill lookahead allocation#41971

Open
Yuyi-Ao wants to merge 1 commit into
vllm-project:mainfrom
Yuyi-Ao:yuyiao/verify-dflash-prefill-lookahead
Open

Fix DFlash first prefill lookahead allocation#41971
Yuyi-Ao wants to merge 1 commit into
vllm-project:mainfrom
Yuyi-Ao:yuyiao/verify-dflash-prefill-lookahead

Conversation

@Yuyi-Ao
Copy link
Copy Markdown
Contributor

@Yuyi-Ao Yuyi-Ao commented May 7, 2026

Purpose

Fix DFlash first-prefill lookahead allocation.

DFlash needs draft KV slots during the first prefill step. The scheduler should therefore allocate lookahead slots/blocks for DFlash even when num_computed_tokens == 0.

This PR also adds a focused test that connects scheduler allocation output to the real DFlash input expansion kernel and verifies the generated query slots are request-owned.

Test Plan

  • Run the new DFlash slot mapping regression test.
  • Compare behavior before and after the scheduler fix.

Test Result

.venv/bin/python -m pytest tests/v1/spec_decode/test_dflash_slot_mapping.py -v

pass

Signed-off-by: George-ao <1586028831@qq.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@Yuyi-Ao
Copy link
Copy Markdown
Contributor Author

Yuyi-Ao commented May 7, 2026

My current understanding is that DFlash needs lookahead KV/slot allocation during first prefill because it runs draft proposal in the same model runner step as the target prefill. During the first DFlash draft proposal, DFlash already needs KV slots for draft-token positions after the prompt.

I added tests/v1/spec_decode/test_dflash_slot_mapping.py to capture the validation I used for this issue. It looks like a repro script than a long-term test. Before the scheduler change, it fails because the kernel maps those first-prefill DFlash query positions through a logical block that was not allocated for the request.

Please correct me if I got any part of the DFlash flow wrong.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces DFlash support in the V1 scheduler, enabling lookahead token allocation during the first prefill step. It also adds a new test suite, test_dflash_slot_mapping.py, to verify that DFlash query slots address request-owned blocks. A critical feedback was provided regarding the initialization of num_lookahead_tokens when DFlash is enabled, as its omission would result in zero effective lookahead tokens.

Comment on lines +223 to +224
if speculative_config.use_dflash():
self.use_dflash = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The num_lookahead_tokens is not initialized when use_dflash is true, which causes effective_lookahead_tokens to be 0 even when use_dflash is enabled. It should be set to self.num_spec_tokens to ensure lookahead slots are allocated.

Suggested change
if speculative_config.use_dflash():
self.use_dflash = True
if speculative_config.use_dflash():
self.use_dflash = True
self.num_lookahead_tokens = self.num_spec_tokens

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_lookahead_tokens is already initialized for DFlash because SpeculativeConfig.use_eagle() currently returns true for "dflash", and that branch sets self.num_lookahead_tokens = self.num_spec_tokens before the new use_dflash() branch runs. The new self.use_dflash flag is only used later to keep first-prefill lookahead enabled for DFlash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant