-
Notifications
You must be signed in to change notification settings - Fork 161
[NV] qwen3.5 fp4 b200 sglang mtp #1257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7f8c16b
40aab28
1a1ee81
e4508ab
82e1eae
0e41941
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2221,3 +2221,13 @@ | |
| - "Update the TensorRT-LLM DeepSeek-V4-Pro image to ghcr.io/semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-9aa3715" | ||
| - "Enable TRTLLM fused MHC by default with the DeepSeek-V4 feature image" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1270 | ||
|
|
||
| - config-keys: | ||
| - qwen3.5-fp4-b200-sglang-mtp | ||
| description: | ||
| - "Update image to lmsysorg/sglang:nightly-dev-20260422-de962f32" | ||
| - "Add tp:2 ep:1 conc 4-128 search-space for 1k1k and 8k1k" | ||
| - "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval" | ||
| - "Add MTP flags: SGLANG_ENABLE_SPEC_V2=1, EAGLE speculative decoding (steps=3, topk=1, draft=4)" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The new perf-changelog.yaml entry for qwen3.5-fp4-b200-sglang-mtp uses the literal placeholder Extended reasoning...What the bug is The new entry appended to pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
Why existing entries don't have this issue Every other entry in How it manifests / impact This is a documentation/metadata defect, not a runtime bug. The benchmark scripts and Step-by-step proof
How to fix Replace pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1257This is the only change needed; nothing else in the file or the rest of the diff needs to be touched. Severity is nit because it doesn't affect benchmark execution, but it should be fixed before merge to maintain the file's traceability invariant. |
||
| - "Reduce prefill/chunked from 32768 to 16384" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1257 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The new
scheduler-recv-intervaluses$( [[ $CONC -gt 4 ]] && echo 30 || echo 10 ), so any CONC>=5 (e.g. CONC=8 in the new TP=2 sweep) gets interval=30. The perf-changelog says this PR aims to "Align server flags with FP4 B200 STP", but the FP4 B200 STP companion (qwen3.5_fp8_b200.sh:32 — actually qwen3.5_fp4_b200.sh:32) and every other qwen3.5 *_mtp.sh sibling still usesCONC -ge 16for the 30/10 cutoff. Could you confirm CONC>4 is intentional (matching qwen3.5_fp8_b200.sh) or change to-ge 16to actually match FP4 B200 STP?Extended reasoning...
What's happening
benchmarks/single_node/qwen3.5_fp4_b200_mtp.sh:52sets:--scheduler-recv-interval $( [[ $CONC -gt 4 ]] && echo 30 || echo 10 )So the threshold for switching to interval=30 is CONC >= 5. The deleted code in the same file (and every other qwen3.5 *_mtp.sh and the FP4 B200 STP companion) used:
i.e. threshold CONC >= 16.
Why this is worth flagging
The perf-changelog entry for this PR says: "Align server flags with FP4 B200 STP: --enable-symm-mem, --expert-parallel-size, dynamic scheduler-recv-interval". But the named reference
benchmarks/single_node/qwen3.5_fp4_b200.sh:32(the FP4 B200 STP script) still usesCONC -ge 16. So the chosen threshold does not match what the PR claims to align with.What it does match
The new threshold matches
benchmarks/single_node/qwen3.5_fp8_b200.sh:51, which was updated in PR #1027 to theCONC -gt 4pattern. The full launch block in the new MTP script is in fact much closer to qwen3.5_fp8_b200.sh than to qwen3.5_fp4_b200.sh (same--enable-symm-mem, same--max-prefill-tokens 16384, same--stream-interval 50, same--mem-fraction-static 0.8). So the most likely scenario is that the author copy-pasted the launch block from qwen3.5_fp8_b200.sh, not from qwen3.5_fp4_b200.sh.Concrete impact
In the new TP=2 search space
conc-start: 4, conc-end: 128, the swept concurrencies that exist in both this MTP script and the FP4 STP companion are 4, 8, 16, 32, 64, 128. At CONC=8:scheduler-recv-interval = 30scheduler-recv-interval = 10That's a 3x divergence in scheduler batching at one swept point. CONC>=16 already matches under both rules, and CONC=4 also matches. So only CONC=8 actually diverges among standard sweep points — measurement effect is small but real.
Why I'm filing as nit, not normal
I'm flagging it because the PR description explicitly names FP4 B200 STP as the alignment target, and the chosen threshold does not actually match that target. Easiest fix: either change to
-ge 16to truly mirror qwen3.5_fp4_b200.sh, or update the perf-changelog to say "align with FP8 B200 STP / FP8 B200 SGLang" instead.