[Model] Introduce first-class DFlash speculative decoding in vLLM V1#34014
[Model] Introduce first-class DFlash speculative decoding in vLLM V1#34014dangoldbj wants to merge 18 commits intovllm-project:mainfrom
Conversation
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
… from Eagle Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
…n proposer tests Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
…erage Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
… regression tests Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
…or metadata snapshot/restore Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
|
Documentation preview: https://vllm--34014.org.readthedocs.build/en/34014/ |
There was a problem hiding this comment.
Code Review
This pull request introduces first-class support for DFlash speculative decoding, a significant new feature for vLLM. The changes are comprehensive, including the core model and proposer logic, configuration updates for auto-detection and validation, and integration into the model runner. The implementation is robust, with two distinct runtime modes (shared_eagle and block_drafting) and careful state management, especially in the complex block_drafting path. The addition of extensive unit, integration, and end-to-end tests, including for batched requests, demonstrates a thorough approach to ensuring correctness. The code is well-structured and adheres to the project's existing patterns. Overall, this is an excellent contribution that is ready for merging.
|
Please strive for a simple implementation. We should be able to reuse a lot of code with the existing runtime and not duplicate so much of the speculative decoding logic. Also, #32887 implements a base support for parallel drafting. It should be very simple to extend this to support DFlash. 500+ lines of new proposer code seems like far too much new code to support DFlash logic. Closing for now. |
|
@benchislett Thanks for reviewing this. |
Purpose
This PR introduces first-class DFlash speculative decoding support in vLLM V1.
This is the initial DFlash implementation in vLLM. Prior to this change, vLLM did not support DFlash speculative decoding. This PR establishes an explicit, config-driven DFlash execution path with correctness and maintainability guarantees.
Key changes:
method="dflash"support and full DFlash model/config plumbing.DFlashProposerexecution path (separate from EAGLE), including:shared_eagleruntime mode.block_draftingruntime mode.Test Plan
Testing focuses on DFlash proposer correctness, backend compatibility, and speculative decoding invariants across batch sizes.
Commands run locally:
python -m pytest -q tests/v1/spec_decode/test_dflash.pypython -m pytest -q tests/v1/worker/test_gpu_model_runner.py -k "dflash_metadata_builder or raise_if_unsupported_dflash_backend"Test Result
tests/v1/spec_decode/test_dflash.py: 22 passedtests/v1/worker/test_gpu_model_runner.py -k "dflash_metadata_builder or raise_if_unsupported_dflash_backend": 4 passedEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.