Skip to content

DeepSeek-V4-Pro: enable num_speculative_tokens=2 on Hopper#435

Merged
zixi-qi merged 1 commit intovllm-project:mainfrom
functionstackx:dsv4-pro-h200-mtp-num-tokens-2
May 4, 2026
Merged

DeepSeek-V4-Pro: enable num_speculative_tokens=2 on Hopper#435
zixi-qi merged 1 commit intovllm-project:mainfrom
functionstackx:dsv4-pro-h200-mtp-num-tokens-2

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

Drop the Hopper-specific num_speculative_tokens=1 override for DeepSeek-V4-Pro MTP so Hopper (H200) uses the same num_speculative_tokens=2 as Blackwell.

Context

The recipe was originally added with a hopper hardware override under spec_decoding because H200 MTP kernels were limited to 1 draft token at the time. Recent vLLM H200 MTP testing shows the kernels accept 2 draft tokens, matching what Blackwell already does (cf. Blackwell MTP submission by @wzhao18). Removing the override aligns Hopper with the recipe default.

If H200 MTP throughput / acceptance rate at num_speculative_tokens=2 ends up worse than at 1 in your testing, the override should be reinstated — but our internal sweep on H200 with this change is in flight, and current evidence is that 2 wins.

Change

   spec_decoding:
-    description: "Multi-Token Prediction speculative decoding with 2 speculative tokens (1 on Hopper)."
+    description: "Multi-Token Prediction speculative decoding with 2 speculative tokens."
     args:
       - "--speculative_config"
       - '{"method":"mtp","num_speculative_tokens":2}'
-    hardware_overrides:
-      hopper:
-        args:
-          - "--speculative_config"
-          - '{"method":"mtp","num_speculative_tokens":1}'

Test plan

  • On an H200 node, launch DeepSeek-V4-Pro with spec_decoding opted in and confirm the engine starts with --speculative_config '{"method":"mtp","num_speculative_tokens":2}'.
  • Acceptance rate is in a sane range and end-to-end throughput at matched concurrency points is at least on par with num_speculative_tokens=1.

The hopper hardware_override pinned MTP to 1 speculative token because the
H200 kernels were limited at the time of the original recipe. Recent vLLM
H200 MTP runs accept 2 draft tokens; remove the override so Hopper uses
the same num_speculative_tokens=2 as Blackwell.

Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment May 4, 2026 11:34pm

Request Review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the speculative decoding configuration for the DeepSeek-V4-Pro model by removing the Hopper-specific hardware override, which increases the number of speculative tokens to 2. The reviewer suggests applying this same change to the DeepSeek-V4-Flash model configuration to maintain consistency across the model family.

- "deepseek_v4"
spec_decoding:
description: "Multi-Token Prediction speculative decoding with 2 speculative tokens (1 on Hopper)."
description: "Multi-Token Prediction speculative decoding with 2 speculative tokens."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The update to num_speculative_tokens=2 for Hopper is consistent with recent kernel improvements mentioned in the PR description. However, DeepSeek-V4-Flash.yaml (lines 58-66) still contains the same num_speculative_tokens=1 override for Hopper. Since this improvement is hardware-dependent and applies to the MTP kernels used by both models, you should consider updating the Flash recipe as well to maintain consistency across the V4 model family and ensure optimal performance for all users on Hopper hardware.

@zixi-qi zixi-qi merged commit 1729694 into vllm-project:main May 4, 2026
4 checks passed
@functionstackx
Copy link
Copy Markdown
Contributor Author

verified perf improvement on mtp2

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants