Skip to content

docs(DeepSeek-V4): add h200|big verified recipes + tune H200 Pro parameters#23742

Merged
wisclmy0611 merged 1 commit intosgl-project:mainfrom
yushengsu-thu:cookbook
Apr 26, 2026
Merged

docs(DeepSeek-V4): add h200|big verified recipes + tune H200 Pro parameters#23742
wisclmy0611 merged 1 commit intosgl-project:mainfrom
yushengsu-thu:cookbook

Conversation

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

Summary

  • Mark h200|big|low-latency, h200|big|balanced, h200|big|max-throughput as verified in the interactive command generator.
  • Tune H200 Pro (big) parameters based on testing:
    • SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: 256 → 128 (balanced & max-throughput)
    • --cuda-graph-max-bs: 32 → 8, --max-running-requests: 64 → 32 (low-latency)
    • --mem-fraction-static: 0.82 → 0.88 (low-latency / balanced / max-throughput)
    • Balanced recipe: add dedicated --cuda-graph-max-bs 8 and --max-running-requests 32 for H200 Pro

Test plan

  • Run mint dev locally and verify the interactive command generator produces correct commands for all H200 Pro combinations
  • Verify no regressions for existing verified combos (H200 small, Blackwell, etc.)

Made with Cursor

Copilot AI review requested due to automatic review settings April 26, 2026 03:26
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the DeepSeek-V4 deployment interactive command generator snippet to treat H200 Pro (big) recipes as verified and to tune several H200 Pro-specific launch parameters based on recent testing.

Changes:

  • Mark h200|big|low-latency, h200|big|balanced, and h200|big|max-throughput as verified (runnable) recipes.
  • Adjust H200 Pro (big) DeepEP dispatch-token cap for balanced/max-throughput and reduce low-latency cuda-graph/max-running-requests.
  • Increase H200 Pro (big) --mem-fraction-static to 0.88 in multiple recipes and add explicit balanced H200 Pro cuda-graph/max-running-requests flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +350 to +353
if (hardware === "h200" && isBig) {
flags.push(" --cuda-graph-max-bs 8");
flags.push(" --max-running-requests 32");
} else if (hardware === "h200") {
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The balanced recipe comment above says H200 big: cg=128 max-run=128 (same) but for hardware === "h200" && isBig the code now emits --cuda-graph-max-bs 8 and --max-running-requests 32. Update the comment to reflect the new H200 Pro tuned values to avoid confusing users (and to preserve the "mirror of allinone" intent documented earlier in the file).

Copilot uses AI. Check for mistakes.
flags.push(" --disable-flashinfer-autotune");
}
if (isBig) flags.push(" --mem-fraction-static 0.82");
if (isBig) flags.push(" --mem-fraction-static 0.88");
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--mem-fraction-static was changed to 0.88 for all isBig low-latency runs. The block comment above says Blackwell big should use mem-frac 0.82, and the PR description suggests this tuning is specific to H200 Pro. Consider gating 0.88 to hardware === "h200" && isBig and keeping the prior value for other big variants to avoid altering existing verified B200/GB300 commands.

Copilot uses AI. Check for mistakes.
Comment on lines 317 to 320
if (hardware === "h200" && isBig) {
flags.push(" --cuda-graph-max-bs 32");
flags.push(" --max-running-requests 64");
flags.push(" --cuda-graph-max-bs 8");
flags.push(" --max-running-requests 32");
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The surrounding allinone summary comment for H200 big low-latency still mentions cg=32 max-run=64 (and mem-frac 0.82), but the actual flags now emit --cuda-graph-max-bs 8 and --max-running-requests 32 (and mem-frac 0.88). Please update the comment to match the new tuned values so the snippet remains self-consistent.

Copilot uses AI. Check for mistakes.
@wisclmy0611 wisclmy0611 merged commit 3cfd156 into sgl-project:main Apr 26, 2026
42 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants