[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding#33624
Conversation
…e is speculative decoding Signed-off-by: Richard Zou <zou3519@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a potential silent incorrectness issue with the fast_moe_cold_start optimization when speculative decoding is enabled. It introduces a new configuration flag, fast_moe_cold_start, which is enabled by default but is now correctly disabled when speculative decoding is active. The documentation for the new flag clearly explains the assumptions and risks associated with this optimization. The implementation is clean and effectively prevents the issue by checking for a speculative decoding configuration and logging a warning when the optimization is consequently ignored. This is a crucial fix for ensuring correctness in MoE models using speculative decoding.
|
running now |
|
thanks for the fix! MODEL := "nvidia/DeepSeek-R1-NVFP4"
GPUS := "4"
PORT := "8001"
launch_mtp:
chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} -tp {{GPUS}} --speculative_config '{"num_speculative_tokens":1, "method":"deepseek_mtp"}' --port {{PORT}} --enforce-eager
benchmark:
vllm bench serve \
--port {{PORT}} \
--model {{MODEL}} \
--dataset-name random \
--input-len 1000 \
--output-len 200 \
--max-concurrency 10 \
--num-prompts 50 \
--seed $(date +%s) \
--temperature 0.0 \ |
…e is speculative decoding (vllm-project#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Pai <416932041@qq.com>
…e is speculative decoding (vllm-project#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: Pai <416932041@qq.com>
…e is speculative decoding (vllm-project#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
…e is speculative decoding (vllm-project#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
I'm also down to turn this optimization off by default too, just let me know.
I don't have a machine to run deepseek v3.2 right now, so someone please test this