[feat] add draft_model spec_decode #4003

HF-001 · 2025-11-05T07:59:38Z

What this PR does / why we need it?

This PR implements the feature of draft_madel speculative decode, and the corresponding RFC is here：#3585

This PR depends on adjusting the code in VLLM, specific adjustments be made here：https://github.com/HF-001/vllm/pull/1/files#diff-645d58630d5acf3a0b07226bfef1e890a584c32502ab97c3d4642070f39a783c, or this PR: vllm-project/vllm#24322

Does this PR introduce any user-facing change?

How was this patch tested?

export CUDA_VISIBLE_DEVICES=7
export TP=1
export MODEL_PATH=/model/qwen3-0.6b
export MODEL_NAME=qwen3-0.6b
export PORT=10113

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.9 --max-model-len 32768 --trust-remote-code --seed 42 --speculative_config '{"method":"draft_model","model":"/model/qwen3-0.6b","num_speculative_tokens":3,"draft_tensor_parallel_size":1, "disable_padded_drafter_batch":true}'

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

Signed-off-by: 01267596 <[email protected]>

github-actions · 2025-11-05T07:59:47Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces support for speculative decoding using a draft model. The implementation involves adding a DraftModelProposer and refactoring the existing EagleProposer into a base class to share common logic. The changes are extensive, touching attention utilities, scheduling configuration, and the model runner.

My review has identified a critical bug in the shared _propose method that will cause a crash when using the new draft_model method due to incorrect handling of hidden states. I have also found a high-severity issue related to logging that could impact production environments. I recommend addressing these issues to ensure the feature is robust and maintainable.

gemini-code-assist · 2025-11-05T08:02:27Z

vllm_ascend/spec_decode/eagle_proposer.py

+        # # Replace the last token with the next token.
+        # # E.g., [b1, b2, c1, c2, c3, c3] -> [a2, b2, b3, c2, c3, c4]
+        # self.input_ids[last_token_indices] = next_token_ids
+        self.set_input_ids_first_pass(target_token_ids, next_token_ids, num_tokens, last_token_indices)


The _propose method in the base class SpecDecodeBaseProposer is not fully generic and will raise an exception when used by DraftModelProposer. The DraftModelProposer calls _propose with target_hidden_states=None, but _propose attempts to use this value unconditionally, which will lead to a crash.

Specifically:

The block for SpecDcodeType.EAGLE3 (line 437) accesses target_hidden_states without checking if it's None.

The assignment self.hidden_states[:num_tokens] = target_hidden_states (line 496) will fail when target_hidden_states is None.

This is a critical issue that will prevent draft_model speculative decoding from working. Since the problematic lines are not part of this diff, I recommend either modifying _propose to be fully generic by adding the necessary guards (e.g., if self.pass_hidden_states_to_model:) or overriding _propose in DraftModelProposer with a simplified implementation that doesn't handle hidden states.

gemini-code-assist · 2025-11-05T08:02:27Z

vllm_ascend/spec_decode/draft_proposer.py

+    new_token_ids = extend_flat_seqs(
+        seqs=input_token_ids, end_locs=query_end_locs, new_vals=next_token_ids
+    )
+    logger.warning("new_token_ids: {}".format(new_token_ids))


This log message appears to be for debugging. Using logger.warning for diagnostic information can flood the logs and obscure actual warnings. Please use logger.debug instead.

Suggested change

logger.warning("new_token_ids: {}".format(new_token_ids))

logger.debug("new_token_ids: {}".format(new_token_ids))

Signed-off-by: 01267596 <[email protected]>

[feat] add draft_model spec_decode

9b219c7

Signed-off-by: 01267596 <[email protected]>

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

01267596 added 2 commits November 5, 2025 08:52

[feat] add draft_model spec_decode

29376fd

Signed-off-by: 01267596 <[email protected]>

[feat] add draft_model spec_decode

458036d

Signed-off-by: 01267596 <[email protected]>

HF-001 force-pushed the draft_model_spec_decode_dev branch from 04d7929 to 458036d Compare November 5, 2025 08:53

01267596 added 2 commits November 5, 2025 09:27

[feat]add draft_model spec_decode

e45c2a6

Signed-off-by: 01267596 <[email protected]>

fix format

85ef466

Signed-off-by: 01267596 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] add draft_model spec_decode #4003

[feat] add draft_model spec_decode #4003

HF-001 commented Nov 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	logger.warning("new_token_ids: {}".format(new_token_ids))
	logger.debug("new_token_ids: {}".format(new_token_ids))

[feat] add draft_model spec_decode #4003

Are you sure you want to change the base?

[feat] add draft_model spec_decode #4003

Conversation

HF-001 commented Nov 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HF-001 commented Nov 5, 2025 •

edited by github-actions bot

Loading