Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2065,7 +2065,7 @@ qwen3.5-fp4-b200-sglang:
- { tp: 2, ep: 1, conc-start: 4, conc-end: 128 }

qwen3.5-fp4-b200-sglang-mtp:
image: lmsysorg/sglang:nightly-dev-20260402-d7256eb6
image: lmsysorg/sglang:nightly-dev-20260422-de962f32
model: nvidia/Qwen3.5-397B-A17B-NVFP4
model-prefix: qwen3.5
runner: b200
Expand All @@ -2077,11 +2077,13 @@ qwen3.5-fp4-b200-sglang-mtp:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- { tp: 4, ep: 1, conc-start: 4, conc-end: 4, spec-decoding: mtp }
- { tp: 2, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- { tp: 4, ep: 1, conc-start: 4, conc-end: 4, spec-decoding: mtp }
- { tp: 2, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }

glm5-fp8-b200-sglang:
image: lmsysorg/sglang:nightly-dev-cu13-20260317-1eea7448
Expand Down
52 changes: 17 additions & 35 deletions benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,61 +20,43 @@ nvidia-smi

hf download "$MODEL"

export NCCL_NVLS_ENABLE=1
export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true
export PYTHONUNBUFFERED=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Default: recv every ~10 requests; if CONC >= 16, relax to ~30 requests between scheduler recv polls.
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi

MEM_FRAC_STATIC=0.85
CHUNKED_PREFILL_SIZE=32768
MAX_PREFILL_TOKENS=32768
CUDA_GRAPH_MAX_BATCH_SIZE=$CONC
MAX_RUNNING_REQUESTS=128
CONTEXT_LENGTH=$((ISL + OSL + 20))
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
CONTEXT_LENGTH="$EVAL_MAX_MODEL_LEN"
fi

if [[ $TP -eq 8 ]]; then
EXTRA_ARGS="--enable-flashinfer-allreduce-fusion"
else
EXTRA_ARGS=""
fi

echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \
--quantization modelopt_fp4 --fp4-gemm-backend flashinfer_cutlass \
--tensor-parallel-size=$TP --data-parallel-size=1 --expert-parallel-size=$EP_SIZE \
--enable-symm-mem \
--disable-radix-cache \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8_e4m3 \
--mamba-ssm-dtype bfloat16 \
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
--context-length $CONTEXT_LENGTH --disable-radix-cache \
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
$EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
--tokenizer-worker-num 6 --stream-interval 30 \
--attention-backend trtllm_mha \
--moe-runner-backend flashinfer_trtllm \
--cuda-graph-max-bs $CONC \
--max-running-requests $CONC \
--max-prefill-tokens 16384 \
--chunked-prefill-size 16384 \
--mem-fraction-static 0.8 \
--stream-interval 50 \
--scheduler-recv-interval $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 ) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new scheduler-recv-interval uses $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 ), so any CONC>=5 (e.g. CONC=8 in the new TP=2 sweep) gets interval=30. The perf-changelog says this PR aims to "Align server flags with FP4 B200 STP", but the FP4 B200 STP companion (qwen3.5_fp8_b200.sh:32 — actually qwen3.5_fp4_b200.sh:32) and every other qwen3.5 *_mtp.sh sibling still uses CONC -ge 16 for the 30/10 cutoff. Could you confirm CONC>4 is intentional (matching qwen3.5_fp8_b200.sh) or change to -ge 16 to actually match FP4 B200 STP?

Extended reasoning...

What's happening

benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh:52 sets:

--scheduler-recv-interval $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 )

So the threshold for switching to interval=30 is CONC >= 5. The deleted code in the same file (and every other qwen3.5 *_mtp.sh and the FP4 B200 STP companion) used:

if [[ $CONC -ge 16 ]]; then SCHEDULER_RECV_INTERVAL=30; else SCHEDULER_RECV_INTERVAL=10; fi

i.e. threshold CONC >= 16.

Why this is worth flagging

The perf-changelog entry for this PR says: "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval". But the named reference benchmarks/single_node/qwen3.5_fp4_b200.sh:32 (the FP4 B200 STP script) still uses CONC -ge 16. So the chosen threshold does not match what the PR claims to align with.

What it does match

The new threshold matches benchmarks/single_node/qwen3.5_fp8_b200.sh:51, which was updated in PR #1027 to the CONC -gt 4 pattern. The full launch block in the new MTP script is in fact much closer to qwen3.5_fp8_b200.sh than to qwen3.5_fp4_b200.sh (same --enable-symm-mem, same --max-prefill-tokens 16384, same --stream-interval 50, same --mem-fraction-static 0.8). So the most likely scenario is that the author copy-pasted the launch block from qwen3.5_fp8_b200.sh, not from qwen3.5_fp4_b200.sh.

Concrete impact

In the new TP=2 search space conc-start: 4, conc-end: 128, the swept concurrencies that exist in both this MTP script and the FP4 STP companion are 4, 8, 16, 32, 64, 128. At CONC=8:

  • This MTP script: scheduler-recv-interval = 30
  • FP4 B200 STP (qwen3.5_fp4_b200.sh): scheduler-recv-interval = 10

That's a 3x divergence in scheduler batching at one swept point. CONC>=16 already matches under both rules, and CONC=4 also matches. So only CONC=8 actually diverges among standard sweep points — measurement effect is small but real.

Why I'm filing as nit, not normal

  • It only affects perf data at CONC=8 — this is a tuning knob, not a correctness bug.
  • It's plausibly intentional: the launch block mirrors the recently-updated qwen3.5_fp8_b200.sh, so the author may have deliberately picked the FP8-B200 pattern.

I'm flagging it because the PR description explicitly names FP4 B200 STP as the alignment target, and the chosen threshold does not actually match that target. Easiest fix: either change to -ge 16 to truly mirror qwen3.5_fp4_b200.sh, or update the perf-changelog to say "align with FP8 B200 STP / FP8 B200 SGLang" instead.

--tokenizer-worker-num 6 \
--tokenizer-path $MODEL \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
> $SERVER_LOG 2>&1 &
--context-length $CONTEXT_LENGTH > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Expand Down
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2221,3 +2221,13 @@
- "Update the TensorRT-LLM DeepSeek-V4-Pro image to ghcr.io/semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-9aa3715"
- "Enable TRTLLM fused MHC by default with the DeepSeek-V4 feature image"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1270

- config-keys:
- qwen3.5-fp4-b200-sglang-mtp
description:
- "Update image to lmsysorg/sglang:nightly-dev-20260422-de962f32"
- "Add tp:2 ep:1 conc 4-128 search-space for 1k1k and 8k1k"
- "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval"
- "Add MTP flags: SGLANG_ENABLE_SPEC_V2=1, EAGLE speculative decoding (steps=3, topk=1, draft=4)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry for qwen3.5-fp4-b200-sglang-mtp uses the literal placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX for its pr-link. Every other entry in this file points to a real PR — please replace XXX with this PR's number (1257) before merge so the changelog stays traceable.

Extended reasoning...

What the bug is

The new entry appended to perf-changelog.yaml (the last block in the file, covering qwen3.5-fp4-b200-sglang-mtp) ends with:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

XXX is a literal placeholder, not a substituted value. The actual PR number is 1257, as shown in the PR metadata.

Why existing entries don't have this issue

Every other entry in perf-changelog.yaml (well over 140 of them) uses a real PR number — e.g. the immediately preceding entry uses pull/1027, and the others span pull/95 through pull/1223. The placeholder XXX is unique to this newly added block and is clearly a stub the author forgot to fill in before pushing.

How it manifests / impact

This is a documentation/metadata defect, not a runtime bug. The benchmark scripts and nvidia-master.yaml config changes work regardless of what is written in perf-changelog.yaml. However, this file is the project's authoritative log mapping config-key changes to the PRs that introduced them; any tooling, reviewer, or future bisecting effort that follows the pr-link for this entry will hit GitHub's 404 page for /pull/XXX (since XXX is not a valid PR number) instead of landing on PR #1257.

Step-by-step proof

  1. Open perf-changelog.yaml and scroll to the bottom — the new entry added by this PR is the last block.
  2. The block's config-keys lists qwen3.5-fp4-b200-sglang-mtp (the new key being introduced in .github/configs/nvidia-master.yaml in this same PR).
  3. Its pr-link field reads https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
  4. The PR metadata shows this PR is number 1257, so the link should read https://github.com/SemiAnalysisAI/InferenceX/pull/1257.
  5. Compare against the entry directly above it (also for qwen3.5-fp8-b200-sglang), which correctly resolves to pull/1027.

How to fix

Replace XXX with 1257 in the new entry, e.g.:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1257

This is the only change needed; nothing else in the file or the rest of the diff needs to be touched. Severity is nit because it doesn't affect benchmark execution, but it should be fixed before merge to maintain the file's traceability invariant.

- "Reduce prefill/chunked from 32768 to 16384"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1257
Loading