Skip to content

Vllm add dflash + Optimize draft models (CUDA graph management)#36733

Closed
biggestCjb wants to merge 35 commits intovllm-project:mainfrom
biggestCjb:vllm-add-dflash
Closed

Vllm add dflash + Optimize draft models (CUDA graph management)#36733
biggestCjb wants to merge 35 commits intovllm-project:mainfrom
biggestCjb:vllm-add-dflash

Conversation

@biggestCjb
Copy link

@biggestCjb biggestCjb commented Mar 11, 2026

Purpose

This PR builds on the DFlash speculative decoding implementation in #32206. It optimizes the draft model side by managing CUDA graph to reduce overhead and improve end-to-end speculative decoding performance.

Test Plan

  • Platform: H800
  • Config: draft number = 4
  • Baseline: EAGLE3 with the same BS=1 and draft number=4
  • Workloads: math,dialogue and chat tasks

Test Result

On H800 with BS=1 and draft number=4:

  • Math tasks: up to 2.0x speedup
  • Dialogue tasks: at least 1.4x speedup
  • vs EAGLE3 (BS=1, draft number=4): ~15% average speed improvement
vllm serve Qwen3-8B \
    --host 0.0.0.0 \
    --port 8898 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --max-model-len 20480 \
    --max-model-seqs 5 \
    --gpu-memory-utilization 0.9 \
    --speculative-config '{"method": "dflash", "model": "Qwen3-8B-DFlash", "num_speculative_tokens": 4, "disable_padded_drafter_batch": true}'

vllm bench serve \
    --base-url http://0.0.0.0:8898 \
    --model Qwen3-8B \
    --tokenizer Qwen3-8B \
    --dataset-name hf \
    --dataset-path mt-bench \
    --hf-name philschmid/mt-bench \
    --max-concurrency 1\
    --num-prompts 80\
    --temperature 0\
h800

Under different concurrency conditions:

method BS= 1 BS= 2 BS= 3 BS= 4
eagle3 222.77 413.02 597.85 739.80
dflash 307.11 568.01 804.87 962.37

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added new-model Requests to new models qwen Related to Qwen models speculative-decoding v1 labels Mar 11, 2026
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the DFlash speculative decoding method. The changes are well-integrated into the existing vLLM architecture, touching upon configuration, model registration, and the core model runner logic. A new model implementation for DFlash with Qwen3 (qwen3_dflash.py) and the corresponding proposer logic (dflash.py) are added. The changes appear to be consistent and correct in enabling this new feature. My main feedback concerns incorrect type hints in the new qwen3_dflash.py file, which could be misleading for future development and static analysis tools.

Note: Security Review is unavailable for this PR.

positions: torch.Tensor,
hidden_states: torch.Tensor,
input_embeds: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The return type hint for Qwen3Model.forward is tuple[torch.Tensor, torch.Tensor], but the implementation at line 346 returns a single torch.Tensor. This should be corrected to -> torch.Tensor: to match the implementation.

Suggested change
) -> tuple[torch.Tensor, torch.Tensor]:
) -> torch.Tensor:

positions: torch.Tensor,
hidden_states: torch.Tensor,
inputs_embeds: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The return type hint for DFlashQwen3ForCausalLM.forward is tuple[torch.Tensor, torch.Tensor]. However, it returns the result of self.model.forward(), which is a single tensor. This should be corrected to -> torch.Tensor: to match the actual return type and be a valid override of the base class method.

Suggested change
) -> tuple[torch.Tensor, torch.Tensor]:
) -> torch.Tensor:

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Remove extra whitespace in function call.

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Refactor DFlash code for improved readability and maintainability.

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 11, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Refactor DFlash model forward method for clarity.

Signed-off-by: biggestCjb <167741811+biggestCjb@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Mar 12, 2026

Hi @biggestCjb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@biggestCjb biggestCjb changed the title Vllm add dflash Vllm add dflash + Optimize draft models (CUDA graph management) Mar 12, 2026
@mergify mergify bot added the nvidia label Mar 12, 2026
@benchislett
Copy link
Collaborator

Closing as this PR is in a rougher state than both #36847 and #36767

@github-project-automation github-project-automation bot moved this to Done in NVIDIA Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-model Requests to new models nvidia qwen Related to Qwen models speculative-decoding v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants