DeepSeek-V4-Pro: enable num_speculative_tokens=2 on Hopper by functionstackx · Pull Request #435 · vllm-project/recipes

functionstackx · 2026-05-04T23:33:43Z

Summary

Drop the Hopper-specific num_speculative_tokens=1 override for DeepSeek-V4-Pro MTP so Hopper (H200) uses the same num_speculative_tokens=2 as Blackwell.

Context

The recipe was originally added with a hopper hardware override under spec_decoding because H200 MTP kernels were limited to 1 draft token at the time. Recent vLLM H200 MTP testing shows the kernels accept 2 draft tokens, matching what Blackwell already does (cf. Blackwell MTP submission by @wzhao18). Removing the override aligns Hopper with the recipe default.

If H200 MTP throughput / acceptance rate at num_speculative_tokens=2 ends up worse than at 1 in your testing, the override should be reinstated — but our internal sweep on H200 with this change is in flight, and current evidence is that 2 wins.

Change

   spec_decoding:
-    description: "Multi-Token Prediction speculative decoding with 2 speculative tokens (1 on Hopper)."
+    description: "Multi-Token Prediction speculative decoding with 2 speculative tokens."
     args:
       - "--speculative_config"
       - '{"method":"mtp","num_speculative_tokens":2}'
-    hardware_overrides:
-      hopper:
-        args:
-          - "--speculative_config"
-          - '{"method":"mtp","num_speculative_tokens":1}'

Test plan

On an H200 node, launch DeepSeek-V4-Pro with spec_decoding opted in and confirm the engine starts with --speculative_config '{"method":"mtp","num_speculative_tokens":2}'.
Acceptance rate is in a sane range and end-to-end throughput at matched concurrency points is at least on par with num_speculative_tokens=1.

The hopper hardware_override pinned MTP to 1 speculative token because the H200 kernels were limited at the time of the original recipe. Recent vLLM H200 MTP runs accept 2 draft tokens; remove the override so Hopper uses the same num_speculative_tokens=2 as Blackwell. Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>

vercel · 2026-05-04T23:33:48Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
vllm-recipes	Ready	Preview, Comment	May 4, 2026 11:34pm

gemini-code-assist

Code Review

This pull request updates the speculative decoding configuration for the DeepSeek-V4-Pro model by removing the Hopper-specific hardware override, which increases the number of speculative tokens to 2. The reviewer suggests applying this same change to the DeepSeek-V4-Flash model configuration to maintain consistency across the model family.

gemini-code-assist · 2026-05-04T23:34:54Z

      - "deepseek_v4"
  spec_decoding:
-    description: "Multi-Token Prediction speculative decoding with 2 speculative tokens (1 on Hopper)."
+    description: "Multi-Token Prediction speculative decoding with 2 speculative tokens."


The update to num_speculative_tokens=2 for Hopper is consistent with recent kernel improvements mentioned in the PR description. However, DeepSeek-V4-Flash.yaml (lines 58-66) still contains the same num_speculative_tokens=1 override for Hopper. Since this improvement is hardware-dependent and applies to the MTP kernels used by both models, you should consider updating the Flash recipe as well to maintain consistency across the V4 model family and ensure optimal performance for all users on Hopper hardware.

functionstackx · 2026-05-05T01:50:09Z

verified perf improvement on mtp2

vercel Bot deployed to Preview May 4, 2026 23:34 View deployment

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

zixi-qi merged commit 1729694 into vllm-project:main May 4, 2026
4 checks passed

This was referenced May 5, 2026

dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2 SemiAnalysisAI/InferenceX#1279

Merged

[Bug]: h200 deepseekv4 pro mtp vllm-project/vllm#41483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-V4-Pro: enable num_speculative_tokens=2 on Hopper#435

DeepSeek-V4-Pro: enable num_speculative_tokens=2 on Hopper#435
zixi-qi merged 1 commit intovllm-project:mainfrom
functionstackx:dsv4-pro-h200-mtp-num-tokens-2

functionstackx commented May 4, 2026

Uh oh!

vercel Bot commented May 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 4, 2026

Uh oh!

Uh oh!

functionstackx commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

functionstackx commented May 4, 2026

Summary

Context

Change

Test plan

Uh oh!

vercel Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

functionstackx commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented May 4, 2026 •

edited

Loading