[Spec Decoding] Integrate DFlash into speculative decoding pipeline by aaronzhfeng · Pull Request #1869 · vllm-project/tpu-inference

aaronzhfeng · 2026-03-05T21:55:55Z

Description

Wire DFlash block-diffusion speculative decoding into the existing TPU inference pipeline. The DFlash model and proposer were added in #1868; this PR connects them to the runner, KV cache manager, and speculative decoding manager so DFlash can be used end-to-end.

No changes to existing Eagle3 or ngram code paths: DFlash gets its own propose_dflash_draft_token_ids method and a separate elif "dflash" dispatch branch.

Modified files:

tpu_inference/models/common/model_loader.py -- register DFlashDraftModel in model registry
tpu_inference/models/jax/qwen3.py -- collect aux_hidden_states from target layers during forward pass (needed by DFlash proposer to inject target context)
tpu_inference/runner/tpu_runner.py -- add DFlashProposer initialization for method="dflash"
tpu_inference/runner/speculative_decoding_manager.py -- add dflash method dispatch and propose_dflash_draft_token_ids (uses accepted_attn_metadata with correct seq_lens for drafter)
tpu_inference/runner/kv_cache_manager.py -- extend draft KV cache allocation to cover dflash, read num_hidden_layers from config instead of hardcoding 1

Usage (after both #1868 and this PR):

args['speculative_config'] = {
    'model': 'z-lab/Qwen3-4B-DFlash-b16',
    'num_speculative_tokens': 5,
    'method': 'dflash',
    'draft_tensor_parallel_size': 1,
}

Tests

E2e tests are in a follow-up PR.

Checklist

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: aaronzhfeng <fzx333578@gmail.com>

kyuyeunk

i think the pr looks okay, but please add a unit test.

it doesn't have to be a big one - like integrating this into a ci: #1870

but just a simple thing - like making sure that functions added in tpu_inference/runner/speculative_decoding_manager.py (like propose_dflash_draft_token_ids) is working correctly, etc, would give me better confidence that this won't break any thing.

Lumosis · 2026-04-01T01:29:30Z

We should precompile the jitted functions for dflash in compilation_manager.py

[Spec Decoding] Integrate DFlash into speculative decoding pipeline

20ec612

Signed-off-by: aaronzhfeng <fzx333578@gmail.com>

aaronzhfeng requested review from Lumosis, jrplatin, kyuyeunk, mrjunwan-lang, sixiang-google, vipannalla and wenxindongwork as code owners March 5, 2026 21:55

This was referenced Mar 5, 2026

[Spec Decoding] Add DFlash e2e tests and Buildkite CI #1870

Open

[Spec Decoding] Add DFlash model and proposer #1868

Open

kyuyeunk reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spec Decoding] Integrate DFlash into speculative decoding pipeline#1869

[Spec Decoding] Integrate DFlash into speculative decoding pipeline#1869
aaronzhfeng wants to merge 1 commit intovllm-project:mainfrom
aaronzhfeng:pr_dflash_1b

aaronzhfeng commented Mar 5, 2026

Uh oh!

kyuyeunk left a comment

Uh oh!

Lumosis commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aaronzhfeng commented Mar 5, 2026

Description

Tests

Checklist

Uh oh!

kyuyeunk left a comment

Choose a reason for hiding this comment

Uh oh!

Lumosis commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants