docs(DeepSeek-V4): add h200|big verified recipes + tune H200 Pro parameters#23742
docs(DeepSeek-V4): add h200|big verified recipes + tune H200 Pro parameters#23742wisclmy0611 merged 1 commit intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Pull request overview
Updates the DeepSeek-V4 deployment interactive command generator snippet to treat H200 Pro (big) recipes as verified and to tune several H200 Pro-specific launch parameters based on recent testing.
Changes:
- Mark
h200|big|low-latency,h200|big|balanced, andh200|big|max-throughputas verified (runnable) recipes. - Adjust H200 Pro (big) DeepEP dispatch-token cap for balanced/max-throughput and reduce low-latency
cuda-graph/max-running-requests. - Increase H200 Pro (big)
--mem-fraction-staticto0.88in multiple recipes and add explicit balanced H200 Procuda-graph/max-running-requestsflags.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (hardware === "h200" && isBig) { | ||
| flags.push(" --cuda-graph-max-bs 8"); | ||
| flags.push(" --max-running-requests 32"); | ||
| } else if (hardware === "h200") { |
There was a problem hiding this comment.
The balanced recipe comment above says H200 big: cg=128 max-run=128 (same) but for hardware === "h200" && isBig the code now emits --cuda-graph-max-bs 8 and --max-running-requests 32. Update the comment to reflect the new H200 Pro tuned values to avoid confusing users (and to preserve the "mirror of allinone" intent documented earlier in the file).
| flags.push(" --disable-flashinfer-autotune"); | ||
| } | ||
| if (isBig) flags.push(" --mem-fraction-static 0.82"); | ||
| if (isBig) flags.push(" --mem-fraction-static 0.88"); |
There was a problem hiding this comment.
--mem-fraction-static was changed to 0.88 for all isBig low-latency runs. The block comment above says Blackwell big should use mem-frac 0.82, and the PR description suggests this tuning is specific to H200 Pro. Consider gating 0.88 to hardware === "h200" && isBig and keeping the prior value for other big variants to avoid altering existing verified B200/GB300 commands.
| if (hardware === "h200" && isBig) { | ||
| flags.push(" --cuda-graph-max-bs 32"); | ||
| flags.push(" --max-running-requests 64"); | ||
| flags.push(" --cuda-graph-max-bs 8"); | ||
| flags.push(" --max-running-requests 32"); | ||
| } |
There was a problem hiding this comment.
The surrounding allinone summary comment for H200 big low-latency still mentions cg=32 max-run=64 (and mem-frac 0.82), but the actual flags now emit --cuda-graph-max-bs 8 and --max-running-requests 32 (and mem-frac 0.88). Please update the comment to match the new tuned values so the snippet remains self-consistent.
Summary
h200|big|low-latency,h200|big|balanced,h200|big|max-throughputas verified in the interactive command generator.SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: 256 → 128 (balanced & max-throughput)--cuda-graph-max-bs: 32 → 8,--max-running-requests: 64 → 32 (low-latency)--mem-fraction-static: 0.82 → 0.88 (low-latency / balanced / max-throughput)--cuda-graph-max-bs 8and--max-running-requests 32for H200 ProTest plan
mint devlocally and verify the interactive command generator produces correct commands for all H200 Pro combinationsMade with Cursor