Skip to content

[Model] Introduce first-class DFlash speculative decoding in vLLM V1#34014

Closed
dangoldbj wants to merge 18 commits intovllm-project:mainfrom
dangoldbj:dflash-1
Closed

[Model] Introduce first-class DFlash speculative decoding in vLLM V1#34014
dangoldbj wants to merge 18 commits intovllm-project:mainfrom
dangoldbj:dflash-1

Conversation

@dangoldbj
Copy link
Contributor

@dangoldbj dangoldbj commented Feb 6, 2026

Purpose

This PR introduces first-class DFlash speculative decoding support in vLLM V1.

This is the initial DFlash implementation in vLLM. Prior to this change, vLLM did not support DFlash speculative decoding. This PR establishes an explicit, config-driven DFlash execution path with correctness and maintainability guarantees.

Key changes:

  • Adds explicit method="dflash" support and full DFlash model/config plumbing.
  • Establishes a dedicated DFlashProposer execution path (separate from EAGLE), including:
    • shared_eagle runtime mode.
    • block_drafting runtime mode.
  • Hardens DFlash correctness for both BS=1 and BS>1 workloads via:
    • per-sequence metadata handling and slot-mapping logic
    • shape and invariant validation
    • backend capability checks for non-causal drafting
  • Removes hardcoded DFlash runtime assumptions in favor of config-driven behavior.
  • Expands DFlash coverage across unit and end-to-end speculative decoding tests.
  • Adds runtime hardening to reduce per-step allocation overhead in DFlash block drafting.

Test Plan

Testing focuses on DFlash proposer correctness, backend compatibility, and speculative decoding invariants across batch sizes.

  • Run DFlash proposer and unit tests.
  • Run DFlash backend guard tests in the GPU model runner.
  • Run targeted speculative decoding end-to-end coverage updates in CI.

Commands run locally:

  • python -m pytest -q tests/v1/spec_decode/test_dflash.py
  • python -m pytest -q tests/v1/worker/test_gpu_model_runner.py -k "dflash_metadata_builder or raise_if_unsupported_dflash_backend"

Test Result

  • tests/v1/spec_decode/test_dflash.py: 22 passed
  • tests/v1/worker/test_gpu_model_runner.py -k "dflash_metadata_builder or raise_if_unsupported_dflash_backend": 4 passed

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
… from Eagle

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
…n proposer tests

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
…erage

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
… regression tests

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
…or metadata snapshot/restore

Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
@mergify
Copy link

mergify bot commented Feb 6, 2026

Documentation preview: https://vllm--34014.org.readthedocs.build/en/34014/

@mergify mergify bot added documentation Improvements or additions to documentation new-model Requests to new models qwen Related to Qwen models speculative-decoding v1 labels Feb 6, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces first-class support for DFlash speculative decoding, a significant new feature for vLLM. The changes are comprehensive, including the core model and proposer logic, configuration updates for auto-detection and validation, and integration into the model runner. The implementation is robust, with two distinct runtime modes (shared_eagle and block_drafting) and careful state management, especially in the complex block_drafting path. The addition of extensive unit, integration, and end-to-end tests, including for batched requests, demonstrates a thorough approach to ensuring correctness. The code is well-structured and adheres to the project's existing patterns. Overall, this is an excellent contribution that is ready for merging.

@benchislett
Copy link
Collaborator

Please strive for a simple implementation. We should be able to reuse a lot of code with the existing runtime and not duplicate so much of the speculative decoding logic.

Also, #32887 implements a base support for parallel drafting. It should be very simple to extend this to support DFlash. 500+ lines of new proposer code seems like far too much new code to support DFlash logic. Closing for now.

@benchislett benchislett closed this Feb 6, 2026
@dangoldbj
Copy link
Contributor Author

@benchislett Thanks for reviewing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation new-model Requests to new models qwen Related to Qwen models speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants