Vllm add dflash + Optimize draft models (CUDA graph management) by biggestCjb · Pull Request #36733 · vllm-project/vllm

biggestCjb · 2026-03-11T02:51:19Z

Purpose

This PR builds on the DFlash speculative decoding implementation in #32206. It optimizes the draft model side by managing CUDA graph to reduce overhead and improve end-to-end speculative decoding performance.

Test Plan

Platform: H800
Config: draft number = 4
Baseline: EAGLE3 with the same BS=1 and draft number=4
Workloads: math，dialogue and chat tasks

Test Result

On H800 with BS=1 and draft number=4:

Math tasks: up to 2.0x speedup
Dialogue tasks: at least 1.4x speedup
vs EAGLE3 (BS=1, draft number=4): ~15% average speed improvement

vllm serve Qwen3-8B \
    --host 0.0.0.0 \
    --port 8898 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --max-model-len 20480 \
    --max-model-seqs 5 \
    --gpu-memory-utilization 0.9 \
    --speculative-config '{"method": "dflash", "model": "Qwen3-8B-DFlash", "num_speculative_tokens": 4, "disable_padded_drafter_batch": true}'

vllm bench serve \
    --base-url http://0.0.0.0:8898 \
    --model Qwen3-8B \
    --tokenizer Qwen3-8B \
    --dataset-name hf \
    --dataset-path mt-bench \
    --hf-name philschmid/mt-bench \
    --max-concurrency 1\
    --num-prompts 80\
    --temperature 0\

Under different concurrency conditions:

method	BS= 1	BS= 2	BS= 3	BS= 4
eagle3	222.77	413.02	597.85	739.80
dflash	307.11	568.01	804.87	962.37

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

github-actions · 2026-03-11T02:51:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-03-11T02:54:59Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist

Code Review

This pull request introduces support for the DFlash speculative decoding method. The changes are well-integrated into the existing vLLM architecture, touching upon configuration, model registration, and the core model runner logic. A new model implementation for DFlash with Qwen3 (qwen3_dflash.py) and the corresponding proposer logic (dflash.py) are added. The changes appear to be consistent and correct in enabling this new feature. My main feedback concerns incorrect type hints in the new qwen3_dflash.py file, which could be misleading for future development and static analysis tools.

_{Note: Security Review is unavailable for this PR.}

gemini-code-assist · 2026-03-11T03:03:01Z

vllm/model_executor/models/qwen3_dflash.py

+        positions: torch.Tensor,
+        hidden_states: torch.Tensor,
+        input_embeds: torch.Tensor | None = None,
+    ) -> tuple[torch.Tensor, torch.Tensor]:


The return type hint for Qwen3Model.forward is tuple[torch.Tensor, torch.Tensor], but the implementation at line 346 returns a single torch.Tensor. This should be corrected to -> torch.Tensor: to match the implementation.

Suggested change

) -> tuple[torch.Tensor, torch.Tensor]:

) -> torch.Tensor:

gemini-code-assist · 2026-03-11T03:03:01Z

vllm/model_executor/models/qwen3_dflash.py

+        positions: torch.Tensor,
+        hidden_states: torch.Tensor,
+        inputs_embeds: torch.Tensor | None = None,
+    ) -> tuple[torch.Tensor, torch.Tensor]:


The return type hint for DFlashQwen3ForCausalLM.forward is tuple[torch.Tensor, torch.Tensor]. However, it returns the result of self.model.forward(), which is a single tensor. This should be corrected to -> torch.Tensor: to match the actual return type and be a valid override of the base class method.

Suggested change

) -> tuple[torch.Tensor, torch.Tensor]:

) -> torch.Tensor:

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-11T03:16:35Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-11T06:43:57Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-11T07:07:01Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-11T07:18:39Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-11T07:22:22Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-11T07:31:22Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-11T07:43:14Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Remove extra whitespace in function call. Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-11T14:03:32Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Refactor DFlash code for improved readability and maintainability. Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-11T14:10:54Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Refactor DFlash model forward method for clarity. Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

mergify · 2026-03-12T03:32:58Z

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

benchislett · 2026-03-12T04:42:30Z

Closing as this PR is in a rougher state than both #36847 and #36767

biggestCjb added 6 commits March 11, 2026 10:40

Add files via upload

912def0

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Add files via upload

c7a30ac

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Add files via upload

10d3ce3

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Add files via upload

4588307

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Add files via upload

2f0cd48

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Add files via upload

294e111

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

biggestCjb requested review from MatthewBonanni, ProExpertProg, WoosukKwon, benchislett, hmellor, houseroad, luccafong, mgoin, njhill, robertgshaw2-redhat, sighingnow, tlrmchlsmth, yewentao256 and youkaichao as code owners March 11, 2026 02:51

mergify bot added new-model Requests to new models qwen Related to Qwen models speculative-decoding v1 labels Mar 11, 2026

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

Refactor EagleModelTypes for better readability

7bdc78d

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Update dflash.py

0e29b25

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

biggestCjb added 3 commits March 11, 2026 14:56

Refactor import statements and clean up code

b49d85c

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Fix formatting of RuntimeError message

ece1552

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Refactor assertions for drafter type checks

fdad71a

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Fix formatting issue in speculative.py

46884ec

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Refactor DFlash metadata and input handling

068e04c

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Update drafter type assertions to include DFlashModelProposer

aec7223

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

biggestCjb added 2 commits March 11, 2026 15:36

Refactor assertion for drafter type checking

ffaab45

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Merge branch 'main' into vllm-add-dflash

a71b219

biggestCjb added 5 commits March 11, 2026 21:48

Merge branch 'vllm-project:main' into vllm-add-dflash

13194ee

Fix formatting of uses_aux_hidden_states assignment

e70a633

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Clean up whitespace in dflash.py

5b738c0

Remove extra whitespace in function call. Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Fix assertion for drafter type check

a4fe5a1

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Fix assertion for drafter type in gpu_model_runner

a93bea1

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Refactor DFlash functions and remove comments

7ae122b

Refactor DFlash code for improved readability and maintainability. Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

Refactor input handling in DFlash model

43f315d

Refactor DFlash model forward method for clarity. Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>

biggestCjb changed the title ~~Vllm add dflash~~ Vllm add dflash + Optimize draft models (CUDA graph management) Mar 12, 2026

mergify bot added the nvidia label Mar 12, 2026

github-project-automation bot added this to NVIDIA Mar 12, 2026

benchislett closed this Mar 12, 2026

github-project-automation bot moved this to Done in NVIDIA Mar 12, 2026

Uh oh!

Conversation

biggestCjb commented Mar 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

benchislett commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

biggestCjb commented Mar 11, 2026 •

edited by github-actions bot

Loading