DeepSeek-V4-Pro: enable num_speculative_tokens=2 on Hopper#435
Conversation
The hopper hardware_override pinned MTP to 1 speculative token because the H200 kernels were limited at the time of the original recipe. Recent vLLM H200 MTP runs accept 2 draft tokens; remove the override so Hopper uses the same num_speculative_tokens=2 as Blackwell. Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Code Review
This pull request updates the speculative decoding configuration for the DeepSeek-V4-Pro model by removing the Hopper-specific hardware override, which increases the number of speculative tokens to 2. The reviewer suggests applying this same change to the DeepSeek-V4-Flash model configuration to maintain consistency across the model family.
| - "deepseek_v4" | ||
| spec_decoding: | ||
| description: "Multi-Token Prediction speculative decoding with 2 speculative tokens (1 on Hopper)." | ||
| description: "Multi-Token Prediction speculative decoding with 2 speculative tokens." |
There was a problem hiding this comment.
The update to num_speculative_tokens=2 for Hopper is consistent with recent kernel improvements mentioned in the PR description. However, DeepSeek-V4-Flash.yaml (lines 58-66) still contains the same num_speculative_tokens=1 override for Hopper. Since this improvement is hardware-dependent and applies to the MTP kernels used by both models, you should consider updating the Flash recipe as well to maintain consistency across the V4 model family and ensure optimal performance for all users on Hopper hardware.

Summary
Drop the Hopper-specific
num_speculative_tokens=1override for DeepSeek-V4-Pro MTP so Hopper (H200) uses the samenum_speculative_tokens=2as Blackwell.Context
The recipe was originally added with a
hopperhardware override underspec_decodingbecause H200 MTP kernels were limited to 1 draft token at the time. Recent vLLM H200 MTP testing shows the kernels accept 2 draft tokens, matching what Blackwell already does (cf. Blackwell MTP submission by @wzhao18). Removing the override aligns Hopper with the recipe default.If H200 MTP throughput / acceptance rate at
num_speculative_tokens=2ends up worse than at 1 in your testing, the override should be reinstated — but our internal sweep on H200 with this change is in flight, and current evidence is that 2 wins.Change
Test plan
spec_decodingopted in and confirm the engine starts with--speculative_config '{"method":"mtp","num_speculative_tokens":2}'.num_speculative_tokens=1.