Vllm add dflash + Optimize draft models (CUDA graph management)#36733
Vllm add dflash + Optimize draft models (CUDA graph management)#36733biggestCjb wants to merge 35 commits intovllm-project:mainfrom
Conversation
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the DFlash speculative decoding method. The changes are well-integrated into the existing vLLM architecture, touching upon configuration, model registration, and the core model runner logic. A new model implementation for DFlash with Qwen3 (qwen3_dflash.py) and the corresponding proposer logic (dflash.py) are added. The changes appear to be consistent and correct in enabling this new feature. My main feedback concerns incorrect type hints in the new qwen3_dflash.py file, which could be misleading for future development and static analysis tools.
Note: Security Review is unavailable for this PR.
| positions: torch.Tensor, | ||
| hidden_states: torch.Tensor, | ||
| input_embeds: torch.Tensor | None = None, | ||
| ) -> tuple[torch.Tensor, torch.Tensor]: |
There was a problem hiding this comment.
| positions: torch.Tensor, | ||
| hidden_states: torch.Tensor, | ||
| inputs_embeds: torch.Tensor | None = None, | ||
| ) -> tuple[torch.Tensor, torch.Tensor]: |
There was a problem hiding this comment.
The return type hint for DFlashQwen3ForCausalLM.forward is tuple[torch.Tensor, torch.Tensor]. However, it returns the result of self.model.forward(), which is a single tensor. This should be corrected to -> torch.Tensor: to match the actual return type and be a valid override of the base class method.
| ) -> tuple[torch.Tensor, torch.Tensor]: | |
| ) -> torch.Tensor: |
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Remove extra whitespace in function call. Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Refactor DFlash code for improved readability and maintainability. Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Refactor DFlash model forward method for clarity. Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
|
Hi @biggestCjb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Purpose
This PR builds on the DFlash speculative decoding implementation in #32206. It optimizes the draft model side by managing CUDA graph to reduce overhead and improve end-to-end speculative decoding performance.
Test Plan
Test Result
On H800 with BS=1 and draft number=4:
Under different concurrency conditions: