[Model] Introduce first-class DFlash speculative decoding in vLLM V1#34014

Closed

dangoldbj wants to merge 18 commits intovllm-project:mainfrom

dangoldbj:dflash-1

Contributor

dangoldbj commented Feb 6, 2026 •

edited

Loading

Purpose

This PR introduces first-class DFlash speculative decoding support in vLLM V1.

This is the initial DFlash implementation in vLLM. Prior to this change, vLLM did not support DFlash speculative decoding. This PR establishes an explicit, config-driven DFlash execution path with correctness and maintainability guarantees.

Key changes:

Adds explicit method="dflash" support and full DFlash model/config plumbing.
Establishes a dedicated DFlashProposer execution path (separate from EAGLE), including:
- shared_eagle runtime mode.
- block_drafting runtime mode.
Hardens DFlash correctness for both BS=1 and BS>1 workloads via:
- per-sequence metadata handling and slot-mapping logic
- shape and invariant validation
- backend capability checks for non-causal drafting
Removes hardcoded DFlash runtime assumptions in favor of config-driven behavior.
Expands DFlash coverage across unit and end-to-end speculative decoding tests.
Adds runtime hardening to reduce per-step allocation overhead in DFlash block drafting.

Test Plan

Testing focuses on DFlash proposer correctness, backend compatibility, and speculative decoding invariants across batch sizes.

Run DFlash proposer and unit tests.
Run DFlash backend guard tests in the GPU model runner.
Run targeted speculative decoding end-to-end coverage updates in CI.

Commands run locally:

python -m pytest -q tests/v1/spec_decode/test_dflash.py
python -m pytest -q tests/v1/worker/test_gpu_model_runner.py -k "dflash_metadata_builder or raise_if_unsupported_dflash_backend"

Test Result

tests/v1/spec_decode/test_dflash.py: 22 passed
tests/v1/worker/test_gpu_model_runner.py -k "dflash_metadata_builder or raise_if_unsupported_dflash_backend": 4 passed

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

dangoldbj added 18 commits

February 6, 2026 20:09


          add dflash in speculative config

cdda524

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          treat dflash as graph-affecting

ac2005c

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          add DFlash architecture mapping in registry

bb56db2

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          add DFlash draft model to architecture test matrix

560cc7b

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          add DFlash Qwen3 draft model executor and registry wiring

abcff56

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          add DFlash Qwen3 draft model executor and registry wiring

d952bda

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          Add dedicated DFlashProposer and route dflash runtime path separately…

fa49ac4

… from Eagle

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          Add DFlash BS1/backend fail-fast validation and unit tests

ca522fe

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          add negative/positive coverage for DFlash backend validation

0516f13

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          enable DFlash runtime dispatch (shared_eagle/block_drafting) and alig…

0d01f71

…n proposer tests

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          add DFlash spec decode docs/example path and expand proposer unit cov…

f7d825f

…erage

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          runtime hardening for DFlash block drafting

cd00076

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          harden DFlash block_drafting metadata handling and add config/runtime…

4f516b0

… regression tests

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          implement BS>1 block_drafting in DFlashProposer without fallback

8086c46

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          harden BS>1 coverage and enforce DFlash vocab-size safety checks

c076ad6

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          add DFlash async e2e coverage, tighten metrics assertions, and refact…

da4dbe3

…or metadata snapshot/restore

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          reduced per-step allocations in block drafting

a505be1

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>


          improve docs

c2efbfd

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>

dangoldbj requested review from DarkLight1337, WoosukKwon, benchislett, hmellor, houseroad, luccafong, mgoin, robertgshaw2-redhat, sighingnow, tlrmchlsmth, youkaichao and ywang96 as code owners

February 6, 2026 19:21

dangoldbj requested review from ProExpertProg and yewentao256 as code owners

February 6, 2026 19:21

mergify bot commented Feb 6, 2026

Documentation preview: https://vllm--34014.org.readthedocs.build/en/34014/

mergify bot added documentation new-model qwen speculative-decoding v1 labels

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces first-class support for DFlash speculative decoding, a significant new feature for vLLM. The changes are comprehensive, including the core model and proposer logic, configuration updates for auto-detection and validation, and integration into the model runner. The implementation is robust, with two distinct runtime modes (shared_eagle and block_drafting) and careful state management, especially in the complex block_drafting path. The addition of extensive unit, integration, and end-to-end tests, including for batched requests, demonstrates a thorough approach to ensuring correctness. The code is well-structured and adheres to the project's existing patterns. Overall, this is an excellent contribution that is ready for merging.

dangoldbj mentioned this pull request

[Feature]: DFlash implementation #32094

Open

1 task

Collaborator

benchislett commented Feb 6, 2026

Please strive for a simple implementation. We should be able to reuse a lot of code with the existing runtime and not duplicate so much of the speculative decoding logic.

Also, #32887 implements a base support for parallel drafting. It should be very simple to extend this to support DFlash. 500+ lines of new proposer code seems like far too much new code to support DFlash logic. Closing for now.

benchislett closed this

Contributor Author

dangoldbj commented Feb 10, 2026

@benchislett Thanks for reviewing this.

StanislavII mentioned this pull request

Dflash integration #36767

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

sighingnow Awaiting requested review from sighingnow sighingnow is a code owner

DarkLight1337 Awaiting requested review from DarkLight1337 DarkLight1337 is a code owner

ywang96 Awaiting requested review from ywang96 ywang96 is a code owner

benchislett Awaiting requested review from benchislett benchislett is a code owner

luccafong Awaiting requested review from luccafong luccafong is a code owner

WoosukKwon Awaiting requested review from WoosukKwon WoosukKwon is a code owner

youkaichao Awaiting requested review from youkaichao youkaichao is a code owner

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat robertgshaw2-redhat is a code owner

mgoin Awaiting requested review from mgoin mgoin is a code owner

tlrmchlsmth Awaiting requested review from tlrmchlsmth tlrmchlsmth is a code owner

houseroad Awaiting requested review from houseroad houseroad is a code owner

hmellor Awaiting requested review from hmellor hmellor is a code owner

yewentao256 Awaiting requested review from yewentao256 yewentao256 is a code owner

ProExpertProg Awaiting requested review from ProExpertProg ProExpertProg is a code owner

1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

Labels

documentation new-model qwen speculative-decoding v1