docs(DeepSeek-V4): verify H200 Pro max-throughput recipe#23726
docs(DeepSeek-V4): verify H200 Pro max-throughput recipe#23726yhyang201 wants to merge 2 commits intosgl-project:mainfrom
Conversation
Update H200 big (Pro 1.6T) max-throughput parameters to match verified 2-node deployment: - DISPATCH_TOKENS: 256 → 128 - --max-running-requests: 256 → 64 - --mem-fraction-static: 0.82 → 0.875 - Remove --cuda-graph-max-bs 128 (not needed) Mark h200|big|max-throughput as verified. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request adds the 'h200|big|max-throughput' configuration to the DeepSeek-V4 deployment documentation, updating environment variables and CLI flags for H200 hardware. Specifically, it adjusts memory fraction and request limits for the 'big' variant. A review comment suggests consolidating the conditional logic for H200-specific overrides to enhance code readability and maintainability.
| if (isBig && hardware === "h200") { | ||
| flags.push(" --mem-fraction-static 0.875"); | ||
| } else if (isBig) { | ||
| flags.push(" --mem-fraction-static 0.82"); | ||
| } | ||
| if (hardware === "h200" && isBig) { | ||
| flags.push(" --max-running-requests 64"); | ||
| } else if (hardware === "h200") { | ||
| flags.push(" --cuda-graph-max-bs 128"); | ||
| flags.push(" --max-running-requests 256"); |
There was a problem hiding this comment.
The logic for adding flags in the max-throughput recipe is slightly fragmented across multiple if blocks. While functional, consolidating the hardware === "h200" && isBig check would improve readability and maintainability, especially as more hardware-specific overrides are added.
if (hardware === "h200" && isBig) {
flags.push(" --mem-fraction-static 0.875");
flags.push(" --max-running-requests 64");
} else {
if (isBig) flags.push(" --mem-fraction-static 0.82");
if (hardware === "h200") {
flags.push(" --cuda-graph-max-bs 128");
flags.push(" --max-running-requests 256");
} else if (isBig && hardware === "b200") {
flags.push(" --cuda-graph-max-bs 64");
flags.push(" --max-running-requests 256");
} else if (isBig && hardware === "gb300") {
flags.push(" --cuda-graph-max-bs 128");
flags.push(" --max-running-requests 256");
}
}Add commented-out hints for machine-specific env vars (NVSHMEM, GLOO, NCCL) on H200 big (2-node) deployments, matching the GB200 pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Move to #23742 |
Summary
DISPATCH_TOKENS: 256 → 128--max-running-requests: 256 → 64--mem-fraction-static: 0.82 → 0.875--cuda-graph-max-bs 128(not needed for H200 big)h200|big|max-throughputas verifiedTest plan
🤖 Generated with Claude Code