Skip to content
Closed
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
069547e
Add dsv4-fp4-b200-sglang single-node config
cquil11 Apr 24, 2026
4c4cb70
Switch dsv4-fp4-b200-sglang to Pro model, match vllm parallelism
cquil11 Apr 24, 2026
33e2d28
Match DSV4 Pro SGLang recipe literally; port HF cache path
cquil11 Apr 24, 2026
ef48416
fix: use 'sglang serve' CLI, not python -m sglang.launch_server
cquil11 Apr 24, 2026
3cec2be
fix: mount repo at /ix for deepseek-v4-blackwell image
cquil11 Apr 24, 2026
b7a7e29
fix: reinstall sglang from PyPI to work around masked editable install
cquil11 Apr 24, 2026
1dc5646
fix: uninstall editable sglang before reinstalling from PyPI
cquil11 Apr 24, 2026
b29d8ec
fix: mount repo at /ix for deepseek-v4-blackwell; drop pip workaround
cquil11 Apr 24, 2026
cc0b95d
fix: unset baked-in CUDA_VISIBLE_DEVICES for deepseek-v4-blackwell image
cquil11 Apr 24, 2026
59182b9
fix: apply same /ix mount fix to launch_b200-nb.sh
cquil11 Apr 24, 2026
d538a4a
Drop --container-name arg from launch_b200-nb.sh
cquil11 Apr 24, 2026
c8b48b5
Update dsv4 B200 SGLang launch: sglang serve + EAGLE speculative deco…
yhyang201 Apr 24, 2026
6ee2f21
Add spec-decoding: mtp to dsv4-fp4-b200-sglang config
yhyang201 Apr 24, 2026
1dd4db6
Add dsv4_fp4_b200_mtp.sh for spec-decoding benchmarks
yhyang201 Apr 24, 2026
0ab8925
Merge remote-tracking branch 'origin/main' into worktree-chore+dsv4-s…
cquil11 Apr 24, 2026
ed1aeda
Restore SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 / SGLANG_OPT_USE_TOPK_V2 …
cquil11 Apr 24, 2026
9b7cb76
Split dsv4-fp4-b200-sglang-mtp search-space to match baseline plumbing
cquil11 Apr 24, 2026
33955e1
Merge branch 'main' into chore/dsv4-sgl-mtp-b200
cquil11 Apr 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1669,6 +1669,24 @@ dsr1-fp4-b200-sglang:
- { tp: 4, ep: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 4, conc-end: 16 }

dsv4-fp4-b200-sglang:
image: lmsysorg/sglang:deepseek-v4-blackwell
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: b200
precision: fp4
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 1024, spec-decoding: "mtp" }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: "mtp" }

# NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1
# does not have a B300-specific recipe, so this config reuses the existing DSR1 FP4
# B200 SGLang recipe as-is until B300-specific tuning is available.
Expand Down
92 changes: 92 additions & 0 deletions benchmarks/single_node/dsv4_fp4_b200.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

hf download "$MODEL"

nvidia-smi

export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0

# The deepseek-v4-blackwell image bakes CUDA_VISIBLE_DEVICES=4,5,6,7 into its ENV,
# which masks half of the 8 GPUs Slurm allocates us. Clear it so TP=8 can bind to
# all ranks.
unset CUDA_VISIBLE_DEVICES

# TODO(Cam): the lmsysorg/sglang:deepseek-v4-blackwell image installs sglang
# editable at /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang.
# The runner mounts our repo at a non-/workspace path for this image so the editable
# install stays visible. Paths in this script are $PWD-relative for that reason.
# Drop the runner conditional once lmsys moves sglang back out of /workspace.

SERVER_LOG="$PWD/server.log"
PORT=${PORT:-8888}

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL"

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi

start_gpu_monitor --output "$PWD/gpu_metrics.csv"

set -x
PYTHONNOUSERSITE=1 \
SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 \
SGLANG_OPT_USE_TOPK_V2=1 \
SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 \
sglang serve \
--trust-remote-code \
--model-path $MODEL \
--tp 8 \
--moe-runner-backend flashinfer_mxfp4 \
--speculative-algo EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \

Check failure on line 61 in benchmarks/single_node/dsv4_fp4_b200.sh

View check run for this annotation

Claude / Claude Code Review

Non-MTP dsv4_fp4_b200.sh enables EAGLE spec decoding

The non-MTP `benchmarks/single_node/dsv4_fp4_b200.sh` enables EAGLE speculative decoding (`--speculative-algo EAGLE`, `--speculative-num-steps 3`, `--speculative-eagle-topk 1`, `--speculative-num-draft-tokens 4`), contradicting both its name and the convention used by every other non-MTP variant in this directory. Drop the four `--speculative-*` flags from the non-MTP script so the runner picks up genuinely-disabled spec decoding when a non-MTP YAML entry is added later (matches dsr1_fp4_b200.sh

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The non-MTP benchmarks/single_node/dsv4_fp4_b200.sh enables EAGLE speculative decoding (--speculative-algo EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4), contradicting both its name and the convention used by every other non-MTP variant in this directory. Drop the four --speculative-* flags from the non-MTP script so the runner picks up genuinely-disabled spec decoding when a non-MTP YAML entry is added later (matches dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_fp8_b200.sh, etc.). The current YAML only references spec-decoding: mtp, so this is dormant today, but it ships in the PR and the changelog explicitly states "Prefix caching and speculative decoding disabled for baseline numbers," which this script contradicts.

Extended reasoning...

What is broken

benchmarks/single_node/dsv4_fp4_b200.sh (the non-MTP variant added in this PR) launches sglang at lines 58–61 with:

--speculative-algo EAGLE \\
--speculative-num-steps 3 \\
--speculative-eagle-topk 1 \\
--speculative-num-draft-tokens 4 \\

The two new dsv4 scripts dsv4_fp4_b200.sh and dsv4_fp4_b200_mtp.sh are byte-for-byte identical except that the _mtp script appends --use-chat-template to the benchmark client (line 85 of the _mtp script). This pattern strongly suggests the non-MTP file was created by copy-pasting the MTP script and forgetting to strip the spec-decoding block.

Why this contradicts repo convention

Across benchmarks/single_node/ every other framework + GPU pair follows a strict <model>_<precision>_<gpu>.sh (no spec) vs <model>_<precision>_<gpu>_mtp.sh (spec on) split — e.g. dsr1_fp4_b200.sh, dsr1_fp8_b200.sh, glm5_fp4_b200.sh, and glm5_fp8_b200.sh contain no --speculative-* flags, while their _mtp.sh siblings do. dsv4_fp4_b200.sh is the only non-_mtp script in this directory that enables EAGLE.

How the runner selects scripts

runners/launch_b200-dgxc-slurm.sh (and launch_b200-nb.sh) build the script name with:

SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
... benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

So an entry without spec-decoding: mtp resolves to dsv4_fp4_b200.sh (no suffix). The current nvidia-master.yaml only adds entries with spec-decoding: "mtp" for dsv4-fp4-b200-sglang, so the broken script is dormant today — that's why no one will see incorrect numbers right now.

Step-by-step proof of the latent bug

  1. A future PR adds a baseline (non-MTP) entry to the dsv4-fp4-b200-sglang config, e.g. { tp: 8, ep: 1, conc-start: 4, conc-end: 1024 } (no spec-decoding field).
  2. SPEC_DECODING is empty for that entry, so SPEC_SUFFIX is empty.
  3. The runner invokes benchmarks/single_node/dsv4_fp4_b200.sh.
  4. That script still launches sglang with --speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
  5. The benchmark records numbers from EAGLE-accelerated decoding and the user labels them as a "baseline (no spec decoding)" baseline. Silently wrong perf numbers result.

Conflict with the PR's own changelog

The perf-changelog.yaml entry added in this PR states for dsv4-fp4-b200-sglang:

"Prefix caching and speculative decoding disabled for baseline numbers"

That promise is correct for the MTP variant only. The non-MTP script as written breaks this contract the moment any non-MTP YAML entry is added.

Fix

Remove these four lines from benchmarks/single_node/dsv4_fp4_b200.sh (lines 58–61):

--speculative-algo EAGLE \\
--speculative-num-steps 3 \\
--speculative-eagle-topk 1 \\
--speculative-num-draft-tokens 4 \\

After this, dsv4_fp4_b200.sh matches its name and matches the convention established by every other non-MTP variant in the directory. The MTP script (dsv4_fp4_b200_mtp.sh) is correct as-is and should keep the spec flags.

--chunked-prefill-size 4096 \
--disable-flashinfer-autotune \
--mem-fraction-static 0.82 \
--host 0.0.0.0 \
--port $PORT > $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $((CONC * 10)) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir "$PWD/"

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
93 changes: 93 additions & 0 deletions benchmarks/single_node/dsv4_fp4_b200_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

hf download "$MODEL"

nvidia-smi

export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0

# The deepseek-v4-blackwell image bakes CUDA_VISIBLE_DEVICES=4,5,6,7 into its ENV,
# which masks half of the 8 GPUs Slurm allocates us. Clear it so TP=8 can bind to
# all ranks.
unset CUDA_VISIBLE_DEVICES

# TODO(Cam): the lmsysorg/sglang:deepseek-v4-blackwell image installs sglang
# editable at /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang.
# The runner mounts our repo at a non-/workspace path for this image so the editable
# install stays visible. Paths in this script are $PWD-relative for that reason.
# Drop the runner conditional once lmsys moves sglang back out of /workspace.

SERVER_LOG="$PWD/server.log"
PORT=${PORT:-8888}

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL"

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi

start_gpu_monitor --output "$PWD/gpu_metrics.csv"

set -x
PYTHONNOUSERSITE=1 \
SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 \
SGLANG_OPT_USE_TOPK_V2=1 \
SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 \
sglang serve \
--trust-remote-code \
--model-path $MODEL \
--tp 8 \

Check warning on line 56 in benchmarks/single_node/dsv4_fp4_b200_mtp.sh

View check run for this annotation

Claude / Claude Code Review

Hard-coded --tp 8 ignores $TP env var

Hard-coded `--tp 8` on line 56 ignores the `$TP` env var that `check_env_vars` enforces and that the script echoes for logging on line 38; same issue at `dsv4_fp4_b200.sh:56`. Latent today (yaml only has tp:8 entries) but inconsistent with every sibling `*_b200*.sh` script in this directory which all parameterize via `--tensor-parallel-size=$TP` or `--tp "$TP"` — worth fixing now while the script is being added. Trivial fix: change `--tp 8` to `--tp "$TP"`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Hard-coded --tp 8 on line 56 ignores the $TP env var that check_env_vars enforces and that the script echoes for logging on line 38; same issue at dsv4_fp4_b200.sh:56. Latent today (yaml only has tp:8 entries) but inconsistent with every sibling *_b200*.sh script in this directory which all parameterize via --tensor-parallel-size=$TP or --tp "$TP" — worth fixing now while the script is being added. Trivial fix: change --tp 8 to --tp "$TP".

Extended reasoning...

What the bug is

Both benchmarks/single_node/dsv4_fp4_b200.sh:56 and benchmarks/single_node/dsv4_fp4_b200_mtp.sh:56 invoke sglang serve with a literal --tp 8, even though the same script:

  1. Calls check_env_vars … TP … at the top, enforcing that $TP is set by the runner.
  2. Echoes "TP: $TP, …" for logging on line 38.
  3. Then ignores $TP and pins TP=8 on line 56.

Every other comparable single-node b200 sglang script parameterizes via $TP. grep in the same directory:

benchmarks/single_node/dsr1_fp4_b200.sh:44:--tensor-parallel-size=$TP …
benchmarks/single_node/dsr1_fp8_b200.sh:76:--tensor-parallel-size=$TP …
benchmarks/single_node/dsr1_fp8_b200_mtp.sh:72:--tensor-parallel-size=$TP \
benchmarks/single_node/dsv4_fp4_b200.sh:56:  --tp 8 \
benchmarks/single_node/dsv4_fp4_b200_mtp.sh:56:  --tp 8 \

Why this is latent today

In .github/configs/nvidia-master.yaml, the new dsv4-fp4-b200-sglang block only defines { tp: 8, … } entries for both seq-len configs — so today $TP always equals 8 and there is no functional divergence. This is why I'm filing as nit, not normal.

Why this still warrants a fix in this PR

If a future PR adds a { tp: 4, … } entry to this config (consistent with dsr1-fp4-b200-sglang, which already has both tp: 4 and tp: 8 entries), the failure mode is silent and confusing:

  1. The runner sweep loop in runners/launch_b200-dgxc-slurm.sh:285 reserves --gres=gpu:$TP — so slurm only allocates 4 GPUs for the job.
  2. The script then unset CUDA_VISIBLE_DEVICES (per the comment on line 25, to clear the image's baked-in mask) and invokes sglang serve --tp 8.
  3. sglang would try to fan out to 8 ranks against a 4-GPU allocation. Worse, the unset CUDA_VISIBLE_DEVICES makes the symptom less obvious because the original 4-GPU mask is being cleared right before the launch.

Addressing the refutation

The refutation argues this is intentional design because:

  • "Comment says TP=8": The comment on lines 23-25 explains why unset CUDA_VISIBLE_DEVICES is needed (the image bakes in a 4-GPU mask) — it documents an image-specific quirk, not a hard TP=8 design constraint. The comment says "so TP=8 can bind to all ranks", not "this script only supports TP=8".
  • "dsr1_fp8_b200_mtp.sh also hardcodes TP=8": This is factually incorrect. dsr1_fp8_b200_mtp.sh uses --tensor-parallel-size=$TP on line 72 (not hardcoded), with a defensive guard if [[ $TP -ne 8 ]]; then exit 1; fi on line 29. That's the right pattern: parameterize via $TP, validate the input. The dsv4 scripts do neither.
  • "Yaml only has tp:8": Acknowledged; that's why this is filed as nit, not normal.

Step-by-step proof of the latent failure

Suppose someone adds a TP=4 entry to the yaml in a future PR:

dsv4-fp4-b200-sglang:
  ...
  search-space:
    - { tp: 4, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: "mtp" }

Then for that sweep entry:

  1. Runner sets TP=4 and runs salloc --gres=gpu:4 … (line 285 in launch_b200-dgxc-slurm.sh).
  2. Slurm allocates a 4-GPU job. CUDA_VISIBLE_DEVICES at this point reflects the 4 allocated GPUs.
  3. dsv4_fp4_b200_mtp.sh runs, echoes "TP: 4, …".
  4. Script unset CUDA_VISIBLE_DEVICES (clears the visible mask).
  5. Script invokes sglang serve --tp 8 — tries to use 8 ranks. Crashes or hangs at startup. The user sees a TP=8 error in a "TP: 4" log context, and has to reverse-engineer why.

How to fix

One-line change on line 56 in each file: --tp 8--tp "$TP". Optionally also add a if [[ $TP -ne 8 ]]; then …; fi guard like dsr1_fp8_b200_mtp.sh:29 if the author wants to keep TP=8-only enforcement explicit. Either way matches the established pattern in the rest of the repo.

--moe-runner-backend flashinfer_mxfp4 \
--speculative-algo EAGLE \

Check warning on line 58 in benchmarks/single_node/dsv4_fp4_b200_mtp.sh

View check run for this annotation

Claude / Claude Code Review

--speculative-algo flag breaks convention (should be --speculative-algorithm)

Both new dsv4 scripts pass `--speculative-algo EAGLE` (dsv4_fp4_b200.sh:58 and dsv4_fp4_b200_mtp.sh:58), while all 17 sibling MTP scripts in benchmarks/single_node/ (dsr1_*_mtp.sh, glm5_*_mtp.sh, qwen3.5_*_mtp.sh) use the canonical `--speculative-algorithm`. The abbreviated form likely works today via argparse prefix-matching, but it's a clear convention break and is fragile (would silently break if sglang ever adds another `--speculative-algo*` flag, making the prefix ambiguous, or switches `sg

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Both new dsv4 scripts pass --speculative-algo EAGLE (dsv4_fp4_b200.sh:58 and dsv4_fp4_b200_mtp.sh:58), while all 17 sibling MTP scripts in benchmarks/single_node/ (dsr1_mtp.sh, glm5mtp.sh, qwen3.5*_mtp.sh) use the canonical --speculative-algorithm. The abbreviated form likely works today via argparse prefix-matching, but it's a clear convention break and is fragile (would silently break if sglang ever adds another --speculative-algo* flag, making the prefix ambiguous, or switches sglang serve from argparse to click). Recommend changing both lines to --speculative-algorithm EAGLE for consistency.

Extended reasoning...

What the bug is

Both new dsv4 launch scripts use a non-canonical flag name:

  • benchmarks/single_node/dsv4_fp4_b200.sh:58--speculative-algo EAGLE
  • benchmarks/single_node/dsv4_fp4_b200_mtp.sh:58--speculative-algo EAGLE

Every other MTP script in this directory (17 files: dsr1_fp8_b200_mtp.sh, dsr1_fp8_b300_mtp.sh, glm5_fp4_b200_mtp.sh, glm5_fp4_b300_mtp.sh, glm5_fp8_b200_mtp.sh, glm5_fp8_b300_mtp.sh, glm5_fp8_mi355x_mtp.sh, qwen3.5_bf16_b200_mtp.sh, qwen3.5_bf16_b300_mtp.sh, qwen3.5_bf16_mi355x_mtp.sh, qwen3.5_fp4_b200_mtp.sh, qwen3.5_fp4_b300_mtp.sh, qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp8_b300_mtp.sh, qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_mi355x_mtp.sh) uses the full --speculative-algorithm EAGLE. Upstream sglang's server_args.py defines the canonical flag as --speculative-algorithm; --speculative-algo is not declared as an alias.

Why it likely works today

Python's argparse with allow_abbrev=True (the default) accepts any unambiguous prefix of a registered long option. The other speculative flags in the same launch (--speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens) do not share the --speculative-algo prefix, so today --speculative-algo resolves uniquely to --speculative-algorithm. So in practice the launch most likely works on the deepseek-v4-blackwell image.

Why it's still worth fixing

  1. Fragile: if sglang ever adds another flag starting with --speculative-algo (e.g., --speculative-algo-version), argparse will then fail with an ambiguous-prefix error and EAGLE will be silently disabled or the launch will hard-fail. This is exactly the kind of breakage that surfaces only after an image bump.
  2. Disabled by default in some Pythons: argparse allow_abbrev was made tunable in Python 3.5; some sglang code paths or future click-based rewrites of sglang serve (which is the new 0.5.x entry point — these are the FIRST scripts in the repo to use sglang serve rather than python3 -m sglang.launch_server) will not honor the abbreviation. Click does not do prefix matching at all.
  3. Consistency: 17 sibling scripts use the full name. Matching the convention costs nothing and removes the question of whether this is intentional.

Addressing the refutation

A reviewer noted the PR description cites the upstream cookbook recipe (https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4) and suggested the PR author may have copied the abbreviated flag from there. That's possible, but it doesn't change the recommendation: the canonical sglang flag is --speculative-algorithm, every existing repo script uses it, and the fix is a one-character-per-line edit (-algo-algorithm). Even if the cookbook also uses the short form, the canonical name is unambiguously safer and consistent with the rest of this repo.

Step-by-step proof

  1. Run grep -n 'speculative-alg' benchmarks/single_node/*.sh — only dsv4_fp4_b200.sh and dsv4_fp4_b200_mtp.sh use --speculative-algo; the other 17 use --speculative-algorithm.
  2. Inspect upstream sglang/srt/server_args.py — argparse registers --speculative-algorithm (no --speculative-algo alias).
  3. Today: argparse with default allow_abbrev=True resolves --speculative-algo--speculative-algorithm because no other registered long option starts with --speculative-algo.
  4. Hypothetical near-future scenario: sglang adds --speculative-algo-version (a plausible sub-knob name). Now --speculative-algo is an ambiguous prefix and argparse exits with error: ambiguous option: --speculative-algo could match --speculative-algorithm, --speculative-algo-version. The dsv4 launch fails or silently disables EAGLE; every other MTP script in the repo continues to work because they pass the full name.

Fix

Change both lines to:

  --speculative-algorithm EAGLE \\

Severity

nit — works today via argparse prefix-matching, but inconsistent with 17 sibling scripts and fragile to upstream additions. Worth fixing while the PR is open.

--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--chunked-prefill-size 4096 \
--disable-flashinfer-autotune \
--mem-fraction-static 0.82 \
--host 0.0.0.0 \
--port $PORT > $SERVER_LOG 2>&1 &

Check failure on line 66 in benchmarks/single_node/dsv4_fp4_b200_mtp.sh

View check run for this annotation

Claude / Claude Code Review

EVAL_CONTEXT_ARGS computed but never passed to sglang serve

EVAL_CONTEXT_ARGS is populated to '--context-length $EVAL_MAX_MODEL_LEN' inside the EVAL_ONLY branch (lines 40-44) but never expanded into the `sglang serve` invocation (lines 49-66), so the server boots with the default model context. When this benchmark runs in EVAL_ONLY mode (e.g. via the multi-node lm-eval flow added in #1000/#1094/#1120), long-context evals will be silently truncated or fail. Append `$EVAL_CONTEXT_ARGS` to the sglang serve command (right before `> $SERVER_LOG`); the same fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 EVAL_CONTEXT_ARGS is populated to '--context-length $EVAL_MAX_MODEL_LEN' inside the EVAL_ONLY branch (lines 40-44) but never expanded into the sglang serve invocation (lines 49-66), so the server boots with the default model context. When this benchmark runs in EVAL_ONLY mode (e.g. via the multi-node lm-eval flow added in #1000/#1094/#1120), long-context evals will be silently truncated or fail. Append $EVAL_CONTEXT_ARGS to the sglang serve command (right before > $SERVER_LOG); the same fix is needed in dsv4_fp4_b200.sh.

Extended reasoning...

What the bug is

In benchmarks/single_node/dsv4_fp4_b200_mtp.sh (and identically in dsv4_fp4_b200.sh), the EVAL_ONLY branch declares and populates EVAL_CONTEXT_ARGS:

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
    setup_eval_context
    EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi

…but the variable is never referenced again in the file. The sglang serve invocation that follows ends with --port $PORT > $SERVER_LOG 2>&1 & with no $EVAL_CONTEXT_ARGS expansion. Net effect: the server boots with whatever the model's default max context is, regardless of what setup_eval_context decided.

The established pattern this diverges from

Every sibling script that defines EVAL_CONTEXT_ARGS also appends it to the launch line. Examples in this repo: dsr1_fp8_b200.sh:80, dsr1_fp4_b200.sh:48, glm5_fp8_b200.sh:56, dsr1_fp8_h200.sh:44, dsr1_fp8_b200_mtp.sh:93, glm5_fp4_b200_mtp.sh:61. benchmarks/benchmark_lib.sh:648 documents the convention: "Scripts then wire $EVAL_MAX_MODEL_LEN into whichever server variable they need." The new dsv4 scripts skip step two.

Why existing code does not prevent it

Bash silently tolerates an unused variable — there is no static check that catches EVAL_CONTEXT_ARGS being set-but-never-expanded. setup_eval_context does its job (computing EVAL_MAX_MODEL_LEN); the breakage is purely at the consumer.

Impact

The end-of-script branch if [ "${RUN_EVAL}" = "true" ]; then run_eval --framework lm-eval --port "$PORT" (lines 87-90) shows the author intended this benchmark to be eval-runnable. Per perf-changelog.yaml, recent PRs (#1000, #1094, #1120) added a multi-node lm-eval flow that exercises EVAL_ONLY, and the new dsv4-fp4-b200-sglang config in nvidia-master.yaml will be picked up by future evals-only entries that follow this pattern. When that happens, lm-eval tasks with prompts longer than the model's default context will either be truncated (silent accuracy degradation) or error with "prompt is longer than maximum context length". This is dead code today, latent eval breakage tomorrow.

Step-by-step proof

  1. Operator sets EVAL_ONLY=true and runs the runner script, which dispatches dsv4_fp4_b200_mtp.sh.
  2. Line 40 sets EVAL_CONTEXT_ARGS="".
  3. Line 41-44 enters the if [ "${EVAL_ONLY}" = "true" ] branch, calls setup_eval_context (which sets EVAL_MAX_MODEL_LEN to e.g. 8192 based on lm-eval task requirements), and assigns EVAL_CONTEXT_ARGS="--context-length 8192".
  4. Lines 49-66 launch sglang serve … --port $PORT > $SERVER_LOG 2>&1 &. grep EVAL_CONTEXT_ARGS against this block returns no match — the variable's expanded value never reaches the server's argv.
  5. SGLang reads max_position_embeddings from the model config and uses that (or its own internal default) as the effective context length.
  6. lm-eval submits a request whose token count exceeds the default → server returns 4xx with a context-length error, or silently truncates depending on backend behavior. The eval result is wrong.

How to fix

Append $EVAL_CONTEXT_ARGS to the sglang serve invocation, right before > $SERVER_LOG:

  --host 0.0.0.0 \
  --port $PORT $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

Apply the same fix in dsv4_fp4_b200.sh. When EVAL_ONLY != true, EVAL_CONTEXT_ARGS stays empty and the launch is unchanged — zero risk to the non-eval benchmark path.


SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $((CONC * 10)) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir "$PWD/" \
--use-chat-template

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
- config-keys:
- dsv4-fp4-b200-sglang
description:
- "Add DeepSeek-V4-Pro single-node B200 SGLang benchmark (TP8, EP8, dp-attention)"
- "Container: lmsysorg/sglang:deepseek-v4-blackwell"
- "Recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4"
- "Parallelism and sweep conc ranges match the dsv4-fp4-b200-vllm config"
- "Prefix caching and speculative decoding disabled for baseline numbers"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1131

Check failure on line 10 in perf-changelog.yaml

View check run for this annotation

Claude / Claude Code Review

perf-changelog description contradicts the shipped yaml and scripts

The new perf-changelog.yaml entry for dsv4-fp4-b200-sglang misrepresents what is actually shipped in this PR: it claims "TP8, EP8, dp-attention" and "Prefix caching and speculative decoding disabled", but nvidia-master.yaml has ep:1 (not 8), neither dsv4_fp4_b200.sh nor dsv4_fp4_b200_mtp.sh passes --enable-dp-attention/--dp-size, both scripts enable EAGLE speculative decoding (--speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4), and neither script disables pr
Comment on lines +1 to +10

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new perf-changelog.yaml entry for dsv4-fp4-b200-sglang misrepresents what is actually shipped in this PR: it claims "TP8, EP8, dp-attention" and "Prefix caching and speculative decoding disabled", but nvidia-master.yaml has ep:1 (not 8), neither dsv4_fp4_b200.sh nor dsv4_fp4_b200_mtp.sh passes --enable-dp-attention/--dp-size, both scripts enable EAGLE speculative decoding (--speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4), and neither script disables prefix caching (no --disable-radix-cache). The pr-link also still points at #1131 (the predecessor PR) instead of this PR. The text appears to have been copy-pasted from #1131's baseline description and not reconciled with the MTP recipe being shipped here — please update it to reflect EP=1, no dp-attention, EAGLE/MTP enabled, prefix caching enabled, and the correct PR number, since perf-changelog.yaml is consumed by tooling and readers to interpret the published numbers.

Extended reasoning...

Summary

The perf-changelog.yaml entry added by this PR contradicts the YAML config and benchmark scripts shipped in the same diff on four points, plus has a stale pr-link.

What the entry claims vs. what is shipped

The new entry (perf-changelog.yaml, top of file):

- config-keys:
    - dsv4-fp4-b200-sglang
  description:
    - "Add DeepSeek-V4-Pro single-node B200 SGLang benchmark (TP8, EP8, dp-attention)"
    - ...
    - "Prefix caching and speculative decoding disabled for baseline numbers"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1131

Claim 1: "EP8". .github/configs/nvidia-master.yaml shipped in the same diff has, for both seq-len configs:

search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 1024, spec-decoding: "mtp" }

EP is 1, not 8.

Claim 2: "dp-attention". Neither benchmarks/single_node/dsv4_fp4_b200.sh nor benchmarks/single_node/dsv4_fp4_b200_mtp.sh passes --enable-dp-attention or --dp-size. dp-attention is not enabled.

Claim 3: "speculative decoding disabled for baseline numbers". Both scripts pass:

--speculative-algo EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \

Speculative decoding is enabled, not disabled. The PR title is literally "dsv4 B200 MTP SGLang launch" and the PR description bullet point is "Add EAGLE speculative decoding".

Claim 4: "Prefix caching ... disabled". sglang has prefix caching on by default; disabling it requires --disable-radix-cache, which neither script passes. So prefix caching is on.

Stale pr-link. The entry points at #1131, but this is PR #1145 (the description says "Based on #1131"). The link should be updated to this PR.

Why this happened

The PR description says "Based on #1131", and #1131 is the predecessor non-MTP, baseline-numbers submission. The author appears to have copy-pasted #1131's changelog entry verbatim and updated the YAML/scripts to the new MTP recipe without reconciling the changelog text. Every clause that's wrong is correct for #1131's baseline configuration.

Step-by-step proof

  1. Open .github/configs/nvidia-master.yaml at the new dsv4-fp4-b200-sglang block. For both isl: 1024 / osl: 1024 and isl: 8192 / osl: 1024, search-space is { tp: 8, ep: 1, ... } — EP=1, contradicting "EP8".
  2. Grep both new scripts for enable-dp-attention or dp-size: zero matches. Contradicts "dp-attention".
  3. Grep both new scripts for speculative-algo: each script has --speculative-algo EAGLE plus the three EAGLE tuning flags. Contradicts "speculative decoding disabled".
  4. Grep both new scripts for disable-radix-cache: zero matches; sglang defaults prefix caching on. Contradicts "Prefix caching ... disabled".
  5. The pr-link field is https://github.com/SemiAnalysisAI/InferenceX/pull/1131 but the PR being reviewed is #1145.

Impact

perf-changelog.yaml is consumed by both tooling and humans to interpret what each published benchmark number represents. A reader comparing dsv4-fp4-b200-sglang numbers against a competing config will be misled into thinking this is a TP8/EP8/dp-attention baseline with speculative decoding off, when it is actually TP8/EP1 with EAGLE MTP enabled and prefix caching on — a fundamentally different operating point. No runtime impact (documentation only), so this is on the boundary between "nit" and "normal"; I'm flagging it as normal because the entry actively misleads about the test conditions of all numbers shipped under this config key.

Suggested fix

Update the entry to reflect what ships:

- config-keys:
- dsr1-fp8-h100-dynamo-trt
- dsr1-fp8-h100-dynamo-sglang
Expand Down
18 changes: 14 additions & 4 deletions runners/launch_b200-dgxc-slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -249,13 +249,23 @@ EOF

else

HF_HUB_CACHE_MOUNT="/scratch/fsw/models"
export MODEL="$HF_HUB_CACHE_MOUNT/${MODEL#*/}"
HF_HUB_CACHE_MOUNT="/scratch/fsw/gharunners/hf-hub-cache"
SQUASH_FILE="/home/sa-shared/containers/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
LOCK_FILE="${SQUASH_FILE}.lock"

# TODO(Cam): lmsysorg/sglang:deepseek-v4-blackwell installs sglang editable at
# /workspace/sglang/python (prior sglang tags used /sgl-workspace/sglang), so
# the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and
# breaks `import sglang`. Mount this one image at /ix instead; drop the
# conditional once the image stops installing editable under /workspace.
if [[ "$IMAGE" == *deepseek-v4-blackwell* ]]; then
CONTAINER_MOUNT_DIR=/ix
else
CONTAINER_MOUNT_DIR=/workspace
fi

salloc --partition=$SLURM_PARTITION --account=$SLURM_ACCOUNT --gres=gpu:$TP --exclusive --time=180 --no-shell --job-name="$RUNNER_NAME"
JOB_ID=$(squeue --name="$RUNNER_NAME" -u "$USER" -h -o %A | head -n1)

Expand All @@ -276,9 +286,9 @@ else

srun --jobid=$JOB_ID \
--container-image=$SQUASH_FILE \
--container-mounts=$GITHUB_WORKSPACE:/workspace/,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE_MOUNT \
--container-mounts=$GITHUB_WORKSPACE:$CONTAINER_MOUNT_DIR,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE \
--no-container-mount-home \
--container-workdir=/workspace/ \
--container-workdir=$CONTAINER_MOUNT_DIR \
--no-container-entrypoint --export=ALL,PORT=8888 \
bash benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh
fi
16 changes: 13 additions & 3 deletions runners/launch_b200-nb.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,24 @@ SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')

UCX_NET_DEVICES=eth0

# TODO(Cam): lmsysorg/sglang:deepseek-v4-blackwell installs sglang editable at
# /workspace/sglang/python (prior sglang tags used /sgl-workspace/sglang), so
# the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and
# breaks `import sglang`. Mount this one image at /ix instead; drop the
# conditional once the image stops installing editable under /workspace.
if [[ "$IMAGE" == *deepseek-v4-blackwell* ]]; then
CONTAINER_MOUNT_DIR=/ix
else
CONTAINER_MOUNT_DIR=/workspace
fi

set -x
srun --partition=$PARTITION --gres=gpu:$TP --exclusive --job-name="$RUNNER_NAME" \
--container-image=$IMAGE \
--container-name=$(echo "$IMAGE" | sed 's/[\/:@#]/_/g')-${USER} \
--container-mounts=$GITHUB_WORKSPACE:/workspace/,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE \
--container-mounts=$GITHUB_WORKSPACE:$CONTAINER_MOUNT_DIR,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE \
--no-container-mount-home \
--container-remap-root \
--container-writable \
--container-workdir=/workspace/ \
--container-workdir=$CONTAINER_MOUNT_DIR \
--no-container-entrypoint --export=ALL,PORT=8888,UCX_NET_DEVICES=$UCX_NET_DEVICES \
bash benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh
Loading