[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding by zou3519 · Pull Request #33624 · vllm-project/vllm

zou3519 · 2026-02-03T01:24:32Z

I'm also down to turn this optimization off by default too, just let me know.

I don't have a machine to run deepseek v3.2 right now, so someone please test this

…e is speculative decoding Signed-off-by: Richard Zou <zou3519@gmail.com>

gemini-code-assist

Code Review

This pull request addresses a potential silent incorrectness issue with the fast_moe_cold_start optimization when speculative decoding is enabled. It introduces a new configuration flag, fast_moe_cold_start, which is enabled by default but is now correctly disabled when speculative decoding is active. The documentation for the new flag clearly explains the assumptions and risks associated with this optimization. The implementation is clean and effectively prevents the issue by checking for a speculative decoding configuration and logging a warning when the optimization is consequently ignored. This is a crucial fix for ensuring correctness in MoE models using speculative decoding.

robertgshaw2-redhat · 2026-02-03T01:33:39Z

running now

robertgshaw2-redhat · 2026-02-03T01:51:38Z

thanks for the fix!

MODEL := "nvidia/DeepSeek-R1-NVFP4"

GPUS := "4"
PORT := "8001"

launch_mtp:
	chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} -tp {{GPUS}} --speculative_config '{"num_speculative_tokens":1, "method":"deepseek_mtp"}' --port {{PORT}} --enforce-eager

benchmark:
	vllm bench serve \
		--port {{PORT}} \
		--model {{MODEL}} \
		--dataset-name random \
		--input-len 1000 \
		--output-len 200 \
		--max-concurrency 10 \
		--num-prompts 50 \
		--seed $(date +%s) \
                --temperature 0.0 \

(APIServer pid=449916) INFO 02-02 21:04:46 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.93, Accepted throughput: 121.29 tokens/s, Drafted throughput: 129.99 tokens/s, Accepted: 1213 tokens, Drafted: 1300 tokens, Per-position acceptance rate: 0.933, Avg Draft acceptance rate: 93.3%

…e is speculative decoding (#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> (cherry picked from commit 5eac9a1)

…e is speculative decoding (vllm-project#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Pai <416932041@qq.com>

…e is speculative decoding (vllm-project#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>

…e is speculative decoding (vllm-project#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

[torch.compile] Don't do the fast moe cold start optimization if ther…

b3be598

…e is speculative decoding Signed-off-by: Richard Zou <zou3519@gmail.com>

zou3519 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners February 3, 2026 01:24

zou3519 mentioned this pull request Feb 3, 2026

[Bug]: Low acceptance rate for DeepSeek-V3.2 with deepseek_mtp speculative method in v0.15.0 #33497

Closed

1 task

gemini-code-assist bot reviewed Feb 3, 2026

View reviewed changes

Merge branch 'main' into speculative_fix

554d6cd

robertgshaw2-redhat enabled auto-merge (squash) February 3, 2026 01:50

robertgshaw2-redhat approved these changes Feb 3, 2026

View reviewed changes

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 3, 2026

mgoin approved these changes Feb 3, 2026

View reviewed changes

mgoin added the bug Something isn't working label Feb 3, 2026

mgoin added this to the v0.15.1 Hotfix milestone Feb 3, 2026

robertgshaw2-redhat merged commit 5eac9a1 into vllm-project:main Feb 3, 2026
45 checks passed

benchislett mentioned this pull request Feb 3, 2026

[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support #33726

Merged

aabbccddwasd mentioned this pull request Feb 4, 2026

[Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate #33771

Merged

5 tasks

This was referenced Feb 4, 2026

[Bug]: Using GPT OSS 20B as Drafter Throws Error #33133

Open

Support Multiple KV-Cache Groups in Speculative Decoding Drafters tomasruizt/vllm#12

Closed

tomasruizt mentioned this pull request Feb 4, 2026

Drafter Supports Multiple KVCache Groups #33318

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding#33624

[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding#33624
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
zou3519:speculative_fix

zou3519 commented Feb 3, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

robertgshaw2-redhat commented Feb 3, 2026

Uh oh!

robertgshaw2-redhat commented Feb 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

zou3519 commented Feb 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

robertgshaw2-redhat commented Feb 3, 2026

Uh oh!

robertgshaw2-redhat commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zou3519 commented Feb 3, 2026 •

edited by github-actions bot

Loading

robertgshaw2-redhat commented Feb 3, 2026 •

edited

Loading