Update DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks#1295
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
@Oseltamivir or @cquil11 can u help review & validate this? |
05052ba to
3a192ab
Compare
|
/sweep |
|
@ch-wan Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25469744396 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25469891963 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25469899213 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25470089427 |
5609be7 to
211c0bc
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25472300450 |
211c0bc to
37967c5
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25472347173 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25472386488 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25472555374 |
47ebbd8 to
c4054eb
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25473604582 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25473604582 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25473604582 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25483080867 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25483080867 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25507127672 |
|
@claude confirm that all the top level parameters changed in the master yamls matches that which is set in the recipe files (ie TP, EP, num workers, etc) |
|
@claude add PR description for posterity |
|
Claude finished @cquil11's task in 1m 24s —— View job Verifying Master YAML ↔ Recipe File Parameter Consistency
ResultsI compared all 6 1.
|
| Parameter | Master YAML | Recipe File |
|---|---|---|
| Prefill num-worker | 1 | prefill_workers: 1 |
| Prefill TP | 4 | tensor-parallel-size: 4 |
| Prefill EP | 4 | expert-parallel-size: 4 |
| Prefill dp-attn | true | enable-dp-attention: true |
| Decode num-worker | 1 | decode_workers: 1 |
| Decode TP | 16 | tensor-parallel-size: 16 |
| Decode EP | 16 | expert-parallel-size: 16 |
| Decode dp-attn | true | enable-dp-attention: true |
| Concurrency | [1024] | concurrencies: "1024" |
| Total nodes | 5 (1P+4D) | prefill_nodes: 1, decode_nodes: 4 ✅ |
2. disagg-gb300-4p1d-dep4-dep16-8-c1024 — ✅ All match
| Parameter | Master YAML | Recipe File |
|---|---|---|
| Prefill num-worker | 4 | prefill_workers: 4 |
| Prefill TP | 4 | tensor-parallel-size: 4 |
| Prefill EP | 4 | expert-parallel-size: 4 |
| Prefill dp-attn | true | enable-dp-attention: true |
| Decode num-worker | 1 | decode_workers: 1 |
| Decode TP | 16 | tensor-parallel-size: 16 |
| Decode EP | 16 | expert-parallel-size: 16 |
| Decode dp-attn | true | enable-dp-attention: true |
| Concurrency | [1024] | concurrencies: "1024" |
| Total nodes | 8 (4P+4D) | prefill_nodes: 4, decode_nodes: 4 ✅ |
3. disagg-gb300-8p1d-dep4-dep16-12-c4096 — ✅ All match
| Parameter | Master YAML | Recipe File |
|---|---|---|
| Prefill num-worker | 8 | prefill_workers: 8 |
| Prefill TP | 4 | tensor-parallel-size: 4 |
| Prefill EP | 4 | expert-parallel-size: 4 |
| Prefill dp-attn | true | enable-dp-attention: true |
| Decode num-worker | 1 | decode_workers: 1 |
| Decode TP | 16 | tensor-parallel-size: 16 |
| Decode EP | 16 | expert-parallel-size: 16 |
| Decode dp-attn | true | enable-dp-attention: true |
| Concurrency | [4096] | concurrencies: "4096" |
| Total nodes | 12 (8P+4D) | prefill_nodes: 8, decode_nodes: 4 ✅ |
4. disagg-gb300-1p1d-tp4-tp4-2-c1 — ✅ All match
| Parameter | Master YAML | Recipe File |
|---|---|---|
| Prefill num-worker | 1 | prefill_workers: 1 |
| Prefill TP | 4 | tensor-parallel-size: 4 |
| Prefill EP | 1 | expert-parallel-size: 1 |
| Prefill dp-attn | false | no enable-dp-attention (absent = false) |
| Decode num-worker | 1 | decode_workers: 1 |
| Decode TP | 4 | tensor-parallel-size: 4 |
| Decode EP | 1 | expert-parallel-size: 1 |
| Decode dp-attn | false | no enable-dp-attention (absent = false) |
| Concurrency | [1] | concurrencies: "1" |
| Total nodes | 2 (1P+1D) | prefill_nodes: 1, decode_nodes: 1 ✅ |
5. disagg-gb300-10p1d-dep4-dep16-14-c8192 — ✅ All match
| Parameter | Master YAML | Recipe File |
|---|---|---|
| Prefill num-worker | 10 | prefill_workers: 10 |
| Prefill TP | 4 | tensor-parallel-size: 4 |
| Prefill EP | 4 | expert-parallel-size: 4 |
| Prefill dp-attn | true | enable-dp-attention: true |
| Decode num-worker | 1 | decode_workers: 1 |
| Decode TP | 16 | tensor-parallel-size: 16 |
| Decode EP | 16 | expert-parallel-size: 16 |
| Decode dp-attn | true | enable-dp-attention: true |
| Concurrency | [8192] | concurrencies: "8192" |
| Total nodes | 14 (10P+4D) | prefill_nodes: 10, decode_nodes: 4 ✅ |
6. disagg-gb300-12p1d-dep4-dep12-15-c21504 — ✅ All match
| Parameter | Master YAML | Recipe File |
|---|---|---|
| Prefill num-worker | 12 | prefill_workers: 12 |
| Prefill TP | 4 | tensor-parallel-size: 4 |
| Prefill EP | 4 | expert-parallel-size: 4 |
| Prefill dp-attn | true | enable-dp-attention: true |
| Decode num-worker | 1 | decode_workers: 1 |
| Decode TP | 12 | tensor-parallel-size: 12 |
| Decode EP | 12 | expert-parallel-size: 12 |
| Decode dp-attn | true | enable-dp-attention: true |
| Concurrency | [21504] | concurrencies: "21504" |
| Total nodes | 15 (12P+3D) | prefill_nodes: 12, decode_nodes: 3 ✅ |
Summary
All 6 entries are fully consistent. Every top-level parameter in nvidia-master.yaml (num-worker, TP, EP, dp-attn, concurrency) matches the corresponding values in each recipe file's resources and sglang_config sections. Node counts encoded in the recipe filenames also check out (prefill_nodes + decode_nodes = total). The container image (lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev) is consistent across the master YAML and all recipe files.
|
Claude finished @cquil11's task in 1m 1s —— View job Adding PR Description
PR description has been added with a full summary of all changes: search-space overhaul (WideEP TP=16 decode, scaled concurrencies), recipe file renames, srt-slurm pin switch to upstream |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25507127672 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25507127672 |
Re-derived each MTP recipe by cloning the closest-topology working base recipe from main (verified to pass CI in PR #1295) and only adding the four `speculative-*` keys for EAGLE/MTP plus topology adjustments (prefill_workers, decode_nodes, gpus_per_decode, decode TP/DP/EP, concurrencies). All env vars, sbatch_directives, container, dynamo hash, frontend args, deepep config, and other tuning come from the working base verbatim — discarding the elvischenv structure that used different env vars (PRECOMPILE vs FAST_WARMUP) and `mxfp4` precision. Recipe -> base mapping: - disagg-low-latency-1p1d-tp4-tp4 -> disagg-gb300-1p1d-tp4-tp4-2-c1 - disagg-mid-curve-1p1d-dep4-dep16 -> disagg-gb300-1p1d-dep4-dep16-5-c1024 (+conc=256) - disagg-mid-curve-1p1d-dep4-dep8 -> wideep base, decode TP=8, conc=256 - disagg-mid-curve-2p1d-dep4-dep8 -> wideep base, 2P, decode TP=8, conc=512 - disagg-mid-curve-4p1d-dep4-dep8 -> wideep base, 4P, decode TP=8, conc=1024 - disagg-low-latency-1p6d-dep4-tp4 -> hybrid: wideep prefill + 1p1d-tp4-tp4 decode Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Overhaul the DeepSeek-V4-Pro FP4 GB300 disaggregated SGLang benchmark configurations to use WideEP TP=16 decode topology across most concurrency points, scale up concurrency targets, switch to the upstream NVIDIA/srt-slurm
mainbranch, and re-enable lm-eval scoring.Key changes
Search-space overhaul (
nvidia-master.yaml)lmsysorg/sglang:deepseek-v4-grace-blackwelltolmsysorg/sglang-staging:deepseek-v4-grace-blackwell-devRecipe files (
benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/)conc{N}.yamlto descriptivedisagg-gb300-{P}p1d-{prefill_topo}-{decode_topo}-{nodes}-c{conc}.yamlnaming conventionconc1024.yamldisagg-gb300-8p1d-dep4-dep16-12-c4096.yamlfor the new 8p1d topologydisable-radix-cache: true) across all recipe filessrt-slurm pin (
runners/launch_gb300-cw.sh)fzyzcjy/srt-slurmfork (pinned at4249d168, which added parallel random prompt generation but lacked the lm-eval orchestrator) to upstreamNVIDIA/srt-slurmmainbranchEval scoring (
generate_sweep_configs.py)gb300-cw/dynamo-sglangeval skip guard (added in PR Day 0 DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks #1157) now that the srt-slurm pin includes the lm-eval orchestrator pathPerf changelog (
perf-changelog.yaml)Topology summary