Skip to content

Update DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks#1295

Merged
cquil11 merged 9 commits into
mainfrom
sglang-disagg-gb300-0506
May 7, 2026
Merged

Update DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks#1295
cquil11 merged 9 commits into
mainfrom
sglang-disagg-gb300-0506

Conversation

@ch-wan
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan commented May 7, 2026

Summary

Overhaul the DeepSeek-V4-Pro FP4 GB300 disaggregated SGLang benchmark configurations to use WideEP TP=16 decode topology across most concurrency points, scale up concurrency targets, switch to the upstream NVIDIA/srt-slurm main branch, and re-enable lm-eval scoring.

Key changes

Search-space overhaul (nvidia-master.yaml)

  • Switch decode workers from mixed TP=8/EP=8 and TP=16/EP=16 to WideEP TP=16/EP=16 across all high-concurrency points (TP=12/EP=12 at the 21504 max-concurrency point)
  • Scale concurrency targets up significantly:
    • 1p1d: 512 → 1024 (5 nodes)
    • 4p1d: 512 → 1024 (8 nodes, was 1p1d with TP=8 decode)
    • 8p1d: new at conc 4096 (12 nodes)
    • 10p1d: 2048 → 8192 (14 nodes)
    • 12p1d: 16384 → 21504 (15 nodes, decode TP=12)
    • 1p1d TP=4 baseline at conc=1 retained
  • Update image from lmsysorg/sglang:deepseek-v4-grace-blackwell to lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev

Recipe files (benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/)

  • Rename all per-concurrency recipe files from conc{N}.yaml to descriptive disagg-gb300-{P}p1d-{prefill_topo}-{decode_topo}-{nodes}-c{conc}.yaml naming convention
  • Delete the now-unused conc1024.yaml
  • Add new disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml for the new 8p1d topology
  • Update TP/EP/resource counts in all recipes to match the new master YAML search-space
  • Disable radix cache (disable-radix-cache: true) across all recipe files

srt-slurm pin (runners/launch_gb300-cw.sh)

  • Switch from the fzyzcjy/srt-slurm fork (pinned at 4249d168, which added parallel random prompt generation but lacked the lm-eval orchestrator) to upstream NVIDIA/srt-slurm main branch
  • This unblocks lm-eval scoring for the GB300 SGLang disagg configs

Eval scoring (generate_sweep_configs.py)

Perf changelog (perf-changelog.yaml)

  • Add entry documenting the search-space overhaul, recipe rename, and eval re-enablement

Topology summary

Config Prefill Decode Nodes Concurrency
1p1d-tp4-tp4 1×TP4 1×TP4 2 1
1p1d-dep4-dep16 1×DP-EP4 1×DP-EP16 5 1,024
4p1d-dep4-dep16 4×DP-EP4 1×DP-EP16 8 1,024
8p1d-dep4-dep16 8×DP-EP4 1×DP-EP16 12 4,096
10p1d-dep4-dep16 10×DP-EP4 1×DP-EP16 14 8,192
12p1d-dep4-dep12 12×DP-EP4 1×DP-EP12 15 21,504

@ch-wan ch-wan requested a review from a team May 7, 2026 00:42
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@functionstackx
Copy link
Copy Markdown
Contributor

@Oseltamivir or @cquil11 can u help review & validate this?

@ch-wan ch-wan force-pushed the sglang-disagg-gb300-0506 branch from 05052ba to 3a192ab Compare May 7, 2026 00:55
@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented May 7, 2026

/sweep

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@ch-wan Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25469744396
Command: ``
Pinned ref: 3a192ab
Approval: not required (trusted collaborator).

Comment thread utils/matrix_logic/generate_sweep_configs.py
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@ch-wan ch-wan force-pushed the sglang-disagg-gb300-0506 branch from 5609be7 to 211c0bc Compare May 7, 2026 02:17
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@ch-wan ch-wan force-pushed the sglang-disagg-gb300-0506 branch from 211c0bc to 37967c5 Compare May 7, 2026 02:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@ch-wan ch-wan force-pushed the sglang-disagg-gb300-0506 branch from 47ebbd8 to c4054eb Compare May 7, 2026 03:00
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented May 7, 2026

@claude confirm that all the top level parameters changed in the master yamls matches that which is set in the recipe files (ie TP, EP, num workers, etc)

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented May 7, 2026

@claude add PR description for posterity

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 7, 2026

Claude finished @cquil11's task in 1m 24s —— View job


Verifying Master YAML ↔ Recipe File Parameter Consistency

  • Read master YAML changes (nvidia-master.yaml)
  • Read each recipe file and extract top-level parameters
  • Cross-check TP, EP, num workers, and other key parameters
  • Report findings

Results

I compared all 6 dsv4-fp4-gb300-dynamo-sglang search-space entries in nvidia-master.yaml against their corresponding recipe YAML files. Parameters checked: num-worker, TP, EP, dp-attn, concurrency, image, and node counts.

1. disagg-gb300-1p1d-dep4-dep16-5-c1024 — ✅ All match

Parameter Master YAML Recipe File
Prefill num-worker 1 prefill_workers: 1
Prefill TP 4 tensor-parallel-size: 4
Prefill EP 4 expert-parallel-size: 4
Prefill dp-attn true enable-dp-attention: true
Decode num-worker 1 decode_workers: 1
Decode TP 16 tensor-parallel-size: 16
Decode EP 16 expert-parallel-size: 16
Decode dp-attn true enable-dp-attention: true
Concurrency [1024] concurrencies: "1024"
Total nodes 5 (1P+4D) prefill_nodes: 1, decode_nodes: 4

2. disagg-gb300-4p1d-dep4-dep16-8-c1024 — ✅ All match

Parameter Master YAML Recipe File
Prefill num-worker 4 prefill_workers: 4
Prefill TP 4 tensor-parallel-size: 4
Prefill EP 4 expert-parallel-size: 4
Prefill dp-attn true enable-dp-attention: true
Decode num-worker 1 decode_workers: 1
Decode TP 16 tensor-parallel-size: 16
Decode EP 16 expert-parallel-size: 16
Decode dp-attn true enable-dp-attention: true
Concurrency [1024] concurrencies: "1024"
Total nodes 8 (4P+4D) prefill_nodes: 4, decode_nodes: 4

3. disagg-gb300-8p1d-dep4-dep16-12-c4096 — ✅ All match

Parameter Master YAML Recipe File
Prefill num-worker 8 prefill_workers: 8
Prefill TP 4 tensor-parallel-size: 4
Prefill EP 4 expert-parallel-size: 4
Prefill dp-attn true enable-dp-attention: true
Decode num-worker 1 decode_workers: 1
Decode TP 16 tensor-parallel-size: 16
Decode EP 16 expert-parallel-size: 16
Decode dp-attn true enable-dp-attention: true
Concurrency [4096] concurrencies: "4096"
Total nodes 12 (8P+4D) prefill_nodes: 8, decode_nodes: 4

4. disagg-gb300-1p1d-tp4-tp4-2-c1 — ✅ All match

Parameter Master YAML Recipe File
Prefill num-worker 1 prefill_workers: 1
Prefill TP 4 tensor-parallel-size: 4
Prefill EP 1 expert-parallel-size: 1
Prefill dp-attn false no enable-dp-attention (absent = false)
Decode num-worker 1 decode_workers: 1
Decode TP 4 tensor-parallel-size: 4
Decode EP 1 expert-parallel-size: 1
Decode dp-attn false no enable-dp-attention (absent = false)
Concurrency [1] concurrencies: "1"
Total nodes 2 (1P+1D) prefill_nodes: 1, decode_nodes: 1

5. disagg-gb300-10p1d-dep4-dep16-14-c8192 — ✅ All match

Parameter Master YAML Recipe File
Prefill num-worker 10 prefill_workers: 10
Prefill TP 4 tensor-parallel-size: 4
Prefill EP 4 expert-parallel-size: 4
Prefill dp-attn true enable-dp-attention: true
Decode num-worker 1 decode_workers: 1
Decode TP 16 tensor-parallel-size: 16
Decode EP 16 expert-parallel-size: 16
Decode dp-attn true enable-dp-attention: true
Concurrency [8192] concurrencies: "8192"
Total nodes 14 (10P+4D) prefill_nodes: 10, decode_nodes: 4

6. disagg-gb300-12p1d-dep4-dep12-15-c21504 — ✅ All match

Parameter Master YAML Recipe File
Prefill num-worker 12 prefill_workers: 12
Prefill TP 4 tensor-parallel-size: 4
Prefill EP 4 expert-parallel-size: 4
Prefill dp-attn true enable-dp-attention: true
Decode num-worker 1 decode_workers: 1
Decode TP 12 tensor-parallel-size: 12
Decode EP 12 expert-parallel-size: 12
Decode dp-attn true enable-dp-attention: true
Concurrency [21504] concurrencies: "21504"
Total nodes 15 (12P+3D) prefill_nodes: 12, decode_nodes: 3

Summary

All 6 entries are fully consistent. Every top-level parameter in nvidia-master.yaml (num-worker, TP, EP, dp-attn, concurrency) matches the corresponding values in each recipe file's resources and sglang_config sections. Node counts encoded in the recipe filenames also check out (prefill_nodes + decode_nodes = total). The container image (lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev) is consistent across the master YAML and all recipe files.


@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 7, 2026

Claude finished @cquil11's task in 1m 1s —— View job


Adding PR Description

  • Gather context on all changed files
  • Draft and update PR description

PR description has been added with a full summary of all changes: search-space overhaul (WideEP TP=16 decode, scaled concurrencies), recipe file renames, srt-slurm pin switch to upstream NVIDIA/srt-slurm main, eval re-enablement, and a topology summary table.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

@cquil11 cquil11 merged commit 876b595 into main May 7, 2026
58 of 64 checks passed
@cquil11 cquil11 deleted the sglang-disagg-gb300-0506 branch May 7, 2026 17:48
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

ch-wan added a commit that referenced this pull request May 8, 2026
Re-derived each MTP recipe by cloning the closest-topology working
base recipe from main (verified to pass CI in PR #1295) and only
adding the four `speculative-*` keys for EAGLE/MTP plus topology
adjustments (prefill_workers, decode_nodes, gpus_per_decode, decode
TP/DP/EP, concurrencies). All env vars, sbatch_directives, container,
dynamo hash, frontend args, deepep config, and other tuning come from
the working base verbatim — discarding the elvischenv structure that
used different env vars (PRECOMPILE vs FAST_WARMUP) and `mxfp4`
precision.

Recipe -> base mapping:
- disagg-low-latency-1p1d-tp4-tp4 -> disagg-gb300-1p1d-tp4-tp4-2-c1
- disagg-mid-curve-1p1d-dep4-dep16 -> disagg-gb300-1p1d-dep4-dep16-5-c1024 (+conc=256)
- disagg-mid-curve-1p1d-dep4-dep8 -> wideep base, decode TP=8, conc=256
- disagg-mid-curve-2p1d-dep4-dep8 -> wideep base, 2P, decode TP=8, conc=512
- disagg-mid-curve-4p1d-dep4-dep8 -> wideep base, 4P, decode TP=8, conc=1024
- disagg-low-latency-1p6d-dep4-tp4 -> hybrid: wideep prefill + 1p1d-tp4-tp4 decode

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

5 participants