-
Notifications
You must be signed in to change notification settings - Fork 203
[NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP #1145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 14 commits
069547e
4c4cb70
33e2d28
ef48416
3cec2be
b7a7e29
1dc5646
b29d8ec
cc0b95d
59182b9
d538a4a
c8b48b5
6ee2f21
1dd4db6
0ab8925
ed1aeda
9b7cb76
33955e1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| nvidia-smi | ||
|
|
||
| export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 | ||
|
|
||
| # The deepseek-v4-blackwell image bakes CUDA_VISIBLE_DEVICES=4,5,6,7 into its ENV, | ||
| # which masks half of the 8 GPUs Slurm allocates us. Clear it so TP=8 can bind to | ||
| # all ranks. | ||
| unset CUDA_VISIBLE_DEVICES | ||
|
|
||
| # TODO(Cam): the lmsysorg/sglang:deepseek-v4-blackwell image installs sglang | ||
| # editable at /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang. | ||
| # The runner mounts our repo at a non-/workspace path for this image so the editable | ||
| # install stays visible. Paths in this script are $PWD-relative for that reason. | ||
| # Drop the runner conditional once lmsys moves sglang back out of /workspace. | ||
|
|
||
| SERVER_LOG="$PWD/server.log" | ||
| PORT=${PORT:-8888} | ||
|
|
||
| echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| start_gpu_monitor --output "$PWD/gpu_metrics.csv" | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 \ | ||
| SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 \ | ||
| SGLANG_OPT_USE_TOPK_V2=1 \ | ||
| SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 \ | ||
| sglang serve \ | ||
| --trust-remote-code \ | ||
| --model-path $MODEL \ | ||
| --tp 8 \ | ||
| --moe-runner-backend flashinfer_mxfp4 \ | ||
| --speculative-algo EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 4 \ | ||
|
Check failure on line 61 in benchmarks/single_node/dsv4_fp4_b200.sh
|
||
| --chunked-prefill-size 4096 \ | ||
| --disable-flashinfer-autotune \ | ||
| --mem-fraction-static 0.82 \ | ||
| --host 0.0.0.0 \ | ||
| --port $PORT > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts $((CONC * 10)) \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir "$PWD/" | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| nvidia-smi | ||
|
|
||
| export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 | ||
|
|
||
| # The deepseek-v4-blackwell image bakes CUDA_VISIBLE_DEVICES=4,5,6,7 into its ENV, | ||
| # which masks half of the 8 GPUs Slurm allocates us. Clear it so TP=8 can bind to | ||
| # all ranks. | ||
| unset CUDA_VISIBLE_DEVICES | ||
|
|
||
| # TODO(Cam): the lmsysorg/sglang:deepseek-v4-blackwell image installs sglang | ||
| # editable at /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang. | ||
| # The runner mounts our repo at a non-/workspace path for this image so the editable | ||
| # install stays visible. Paths in this script are $PWD-relative for that reason. | ||
| # Drop the runner conditional once lmsys moves sglang back out of /workspace. | ||
|
|
||
| SERVER_LOG="$PWD/server.log" | ||
| PORT=${PORT:-8888} | ||
|
|
||
| echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| start_gpu_monitor --output "$PWD/gpu_metrics.csv" | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 \ | ||
| SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 \ | ||
| SGLANG_OPT_USE_TOPK_V2=1 \ | ||
| SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 \ | ||
| sglang serve \ | ||
| --trust-remote-code \ | ||
| --model-path $MODEL \ | ||
| --tp 8 \ | ||
|
Check warning on line 56 in benchmarks/single_node/dsv4_fp4_b200_mtp.sh
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 Hard-coded Extended reasoning...What the bug isBoth
Every other comparable single-node b200 sglang script parameterizes via Why this is latent todayIn Why this still warrants a fix in this PRIf a future PR adds a
Addressing the refutationThe refutation argues this is intentional design because:
Step-by-step proof of the latent failureSuppose someone adds a TP=4 entry to the yaml in a future PR: dsv4-fp4-b200-sglang:
...
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: "mtp" }Then for that sweep entry:
How to fixOne-line change on line 56 in each file: |
||
| --moe-runner-backend flashinfer_mxfp4 \ | ||
| --speculative-algo EAGLE \ | ||
|
Check warning on line 58 in benchmarks/single_node/dsv4_fp4_b200_mtp.sh
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 Both new dsv4 scripts pass Extended reasoning...What the bug isBoth new dsv4 launch scripts use a non-canonical flag name:
Every other MTP script in this directory (17 files: Why it likely works todayPython's Why it's still worth fixing
Addressing the refutationA reviewer noted the PR description cites the upstream cookbook recipe (https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4) and suggested the PR author may have copied the abbreviated flag from there. That's possible, but it doesn't change the recommendation: the canonical sglang flag is Step-by-step proof
FixChange both lines to: Severity
|
||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 4 \ | ||
| --chunked-prefill-size 4096 \ | ||
| --disable-flashinfer-autotune \ | ||
| --mem-fraction-static 0.82 \ | ||
| --host 0.0.0.0 \ | ||
| --port $PORT > $SERVER_LOG 2>&1 & | ||
|
Check failure on line 66 in benchmarks/single_node/dsv4_fp4_b200_mtp.sh
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 EVAL_CONTEXT_ARGS is populated to '--context-length $EVAL_MAX_MODEL_LEN' inside the EVAL_ONLY branch (lines 40-44) but never expanded into the Extended reasoning...What the bug is In EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi…but the variable is never referenced again in the file. The The established pattern this diverges from Every sibling script that defines Why existing code does not prevent it Bash silently tolerates an unused variable — there is no static check that catches Impact The end-of-script branch Step-by-step proof
How to fix Append --host 0.0.0.0 \
--port $PORT $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &Apply the same fix in |
||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts $((CONC * 10)) \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir "$PWD/" \ | ||
| --use-chat-template | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,13 @@ | ||
| - config-keys: | ||
| - dsv4-fp4-b200-sglang | ||
| description: | ||
| - "Add DeepSeek-V4-Pro single-node B200 SGLang benchmark (TP8, EP8, dp-attention)" | ||
| - "Container: lmsysorg/sglang:deepseek-v4-blackwell" | ||
| - "Recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4" | ||
| - "Parallelism and sweep conc ranges match the dsv4-fp4-b200-vllm config" | ||
| - "Prefix caching and speculative decoding disabled for baseline numbers" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1131 | ||
|
|
||
|
Check failure on line 10 in perf-changelog.yaml
|
||
|
Comment on lines
+1
to
+10
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The new perf-changelog.yaml entry for dsv4-fp4-b200-sglang misrepresents what is actually shipped in this PR: it claims "TP8, EP8, dp-attention" and "Prefix caching and speculative decoding disabled", but nvidia-master.yaml has ep:1 (not 8), neither dsv4_fp4_b200.sh nor dsv4_fp4_b200_mtp.sh passes --enable-dp-attention/--dp-size, both scripts enable EAGLE speculative decoding (--speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4), and neither script disables prefix caching (no --disable-radix-cache). The pr-link also still points at #1131 (the predecessor PR) instead of this PR. The text appears to have been copy-pasted from #1131's baseline description and not reconciled with the MTP recipe being shipped here — please update it to reflect EP=1, no dp-attention, EAGLE/MTP enabled, prefix caching enabled, and the correct PR number, since perf-changelog.yaml is consumed by tooling and readers to interpret the published numbers. Extended reasoning...SummaryThe perf-changelog.yaml entry added by this PR contradicts the YAML config and benchmark scripts shipped in the same diff on four points, plus has a stale What the entry claims vs. what is shippedThe new entry (perf-changelog.yaml, top of file): - config-keys:
- dsv4-fp4-b200-sglang
description:
- "Add DeepSeek-V4-Pro single-node B200 SGLang benchmark (TP8, EP8, dp-attention)"
- ...
- "Prefix caching and speculative decoding disabled for baseline numbers"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1131Claim 1: "EP8". search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 1024, spec-decoding: "mtp" }EP is 1, not 8. Claim 2: "dp-attention". Neither Claim 3: "speculative decoding disabled for baseline numbers". Both scripts pass: Speculative decoding is enabled, not disabled. The PR title is literally "dsv4 B200 MTP SGLang launch" and the PR description bullet point is "Add EAGLE speculative decoding". Claim 4: "Prefix caching ... disabled". sglang has prefix caching on by default; disabling it requires Stale Why this happenedThe PR description says "Based on #1131", and #1131 is the predecessor non-MTP, baseline-numbers submission. The author appears to have copy-pasted #1131's changelog entry verbatim and updated the YAML/scripts to the new MTP recipe without reconciling the changelog text. Every clause that's wrong is correct for #1131's baseline configuration. Step-by-step proof
Impactperf-changelog.yaml is consumed by both tooling and humans to interpret what each published benchmark number represents. A reader comparing dsv4-fp4-b200-sglang numbers against a competing config will be misled into thinking this is a TP8/EP8/dp-attention baseline with speculative decoding off, when it is actually TP8/EP1 with EAGLE MTP enabled and prefix caching on — a fundamentally different operating point. No runtime impact (documentation only), so this is on the boundary between "nit" and "normal"; I'm flagging it as Suggested fixUpdate the entry to reflect what ships:
|
||
| - config-keys: | ||
| - dsr1-fp8-h100-dynamo-trt | ||
| - dsr1-fp8-h100-dynamo-sglang | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The non-MTP
benchmarks/single_node/dsv4_fp4_b200.shenables EAGLE speculative decoding (--speculative-algo EAGLE,--speculative-num-steps 3,--speculative-eagle-topk 1,--speculative-num-draft-tokens 4), contradicting both its name and the convention used by every other non-MTP variant in this directory. Drop the four--speculative-*flags from the non-MTP script so the runner picks up genuinely-disabled spec decoding when a non-MTP YAML entry is added later (matches dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_fp8_b200.sh, etc.). The current YAML only referencesspec-decoding: mtp, so this is dormant today, but it ships in the PR and the changelog explicitly states "Prefix caching and speculative decoding disabled for baseline numbers," which this script contradicts.Extended reasoning...
What is broken
benchmarks/single_node/dsv4_fp4_b200.sh(the non-MTP variant added in this PR) launches sglang at lines 58–61 with:The two new dsv4 scripts
dsv4_fp4_b200.shanddsv4_fp4_b200_mtp.share byte-for-byte identical except that the_mtpscript appends--use-chat-templateto the benchmark client (line 85 of the_mtpscript). This pattern strongly suggests the non-MTP file was created by copy-pasting the MTP script and forgetting to strip the spec-decoding block.Why this contradicts repo convention
Across
benchmarks/single_node/every other framework + GPU pair follows a strict<model>_<precision>_<gpu>.sh(no spec) vs<model>_<precision>_<gpu>_mtp.sh(spec on) split — e.g.dsr1_fp4_b200.sh,dsr1_fp8_b200.sh,glm5_fp4_b200.sh, andglm5_fp8_b200.shcontain no--speculative-*flags, while their_mtp.shsiblings do.dsv4_fp4_b200.shis the only non-_mtpscript in this directory that enables EAGLE.How the runner selects scripts
runners/launch_b200-dgxc-slurm.sh(andlaunch_b200-nb.sh) build the script name with:So an entry without
spec-decoding: mtpresolves todsv4_fp4_b200.sh(no suffix). The currentnvidia-master.yamlonly adds entries withspec-decoding: "mtp"fordsv4-fp4-b200-sglang, so the broken script is dormant today — that's why no one will see incorrect numbers right now.Step-by-step proof of the latent bug
{ tp: 8, ep: 1, conc-start: 4, conc-end: 1024 }(nospec-decodingfield).SPEC_DECODINGis empty for that entry, soSPEC_SUFFIXis empty.benchmarks/single_node/dsv4_fp4_b200.sh.--speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.Conflict with the PR's own changelog
The
perf-changelog.yamlentry added in this PR states fordsv4-fp4-b200-sglang:That promise is correct for the MTP variant only. The non-MTP script as written breaks this contract the moment any non-MTP YAML entry is added.
Fix
Remove these four lines from
benchmarks/single_node/dsv4_fp4_b200.sh(lines 58–61):After this,
dsv4_fp4_b200.shmatches its name and matches the convention established by every other non-MTP variant in the directory. The MTP script (dsv4_fp4_b200_mtp.sh) is correct as-is and should keep the spec flags.