Add DSV4 Pro GB300 high-throughput recipe by weireweire · Pull Request #95 · NVIDIA/srt-slurm

weireweire · 2026-04-27T11:06:07Z

No description provided.

qiching

LGTM

* feat: DeepSeek-V4-Pro disaggregated 1P1D recipe on GB300 (1k/1k) Adds the dynamo + NIXL disaggregated counterpart to the existing `gb300-fp4/1k1k-dsv4/agg-*` recipes: 1 prefill node + 1 decode node, both TP=4 on a single GB300, MXFP4 MoE kernels, chunked-prefill 4096. Same DSv4-Pro checkpoint and `dsv4-grace-blackwell` container as the agg recipes; nginx fan-in container is pulled from Docker Hub via enroot. `benchmark.type` is `manual` so the recipe brings the disagg server up and stops there — pair with sa-bench (custom_tokenizer `sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer` + chat template) once the server is healthy. README updated with a `Disaggregated` table to keep the existing agg matrix intact. Made-with: Cursor * feat: add 4 multi-node DEP / TP disagg recipes on GB300 (1k/1k) Builds on the existing 1P1D TP=4 disagg recipe by adding four more points along the disagg topology curve, all sharing the same dynamo + NIXL frontend and the `dsv4-grace-blackwell` container: - disagg-1p1d-dep4-mega-moe.yaml (2 nodes, 8 GPU; both DEP=4) - disagg-1p2d-dep4-to-dep8-mega-moe.yaml (3 nodes, 12 GPU; P DEP=4, D DEP=8) - disagg-2p2d-dep8-mega-moe.yaml (4 nodes, 16 GPU; both DEP=8) - disagg-2p2d-tp8-mxfp4.yaml (4 nodes, 16 GPU; both TP=8, MXFP4) DEP recipes use TP+DP+DP-attention+DeepEP (mega_moe / DeepGEMM), mirroring the agg-balanced-tep / agg-max-tpt-tep topology but split across prefill and decode roles. Multi-node decode recipes intentionally do NOT set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 because CAR_V2 is single-node only and silently corrupts results across nodes. Also tightens the existing disagg-1p1d-tp4-mxfp4.yaml: switches from `benchmark.type: manual` to a low-latency sa-bench sweep (conc 4..128) and adds the same mrr / cgmb / mfs knobs as the new recipes for reproducibility. README gains: - a prominent NIXL state-buffer-fix prerequisite warning (upstream sglang PR pending) so reviewers know what container behaviour the recipes assume, - an XPYD = nodes (not instances) clarification, - a verified-throughput table from sa-bench runs at isl=osl=1024. Headline: the asymmetric 1P2D DEP4->DEP8 config delivers the highest per-GPU total token throughput (5,572 TPS/GPU at conc=2048) because at 1k/1k the workload is decode-heavy, so doubling the decode EP domain (4 -> 8 GPUs) buys far more than scaling prefill. Recipes are intentionally clean of local mounts / debug paths — pick up the required nixl/conn.py state-buffer-transfer fix via the container build process until the upstream sglang fix lands. Made-with: Cursor * nixl * reshuffle like them mf experts * gb200 * aime * aime * fix: add missing mega_moe SGLANG_OPT_* env to dsv4 disagg recipes (#94) SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0 only works when SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 is also set. Without it, DeepEP buffer is too small for cuda-graph-max-bs=1024/2048 and capture hits deep_ep.cpp:1233 assertion. Add the full mega_moe env block to all three *-mega-moe.yaml, plus SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 only on single-node sides. * Add DSV4 Pro GB300 high-throughput recipe (#95) * Add DSV4 Pro GB300 high-throughput recipe * fix wideep oom * optimize perf * moved * go * cleanups * go --------- Co-authored-by: Yangmin Li <yangminl@nvidia.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: weireweire <weiliangl@nvidia.com>

weireweire requested review from hjjq, kedarpotdar-nv, kyleliang-nv and qiching as code owners April 27, 2026 11:06

weireweire force-pushed the recipes/dsv4-agg-disagg branch from c15fac7 to 0ea486a Compare April 27, 2026 12:57

Add DSV4 Pro GB300 high-throughput recipe

1153bea

weireweire force-pushed the recipes/dsv4-agg-disagg branch from 0ea486a to 1153bea Compare April 27, 2026 12:58

weireweire added 2 commits April 27, 2026 07:21

fix wideep oom

0e1608a

optimize perf

051024e

qiching approved these changes Apr 27, 2026

View reviewed changes

ishandhanani merged commit 8dca187 into NVIDIA:recipes/dsv4-agg-disagg Apr 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DSV4 Pro GB300 high-throughput recipe#95

Add DSV4 Pro GB300 high-throughput recipe#95
ishandhanani merged 3 commits intoNVIDIA:recipes/dsv4-agg-disaggfrom
weireweire:recipes/dsv4-agg-disagg

weireweire commented Apr 27, 2026

Uh oh!

qiching left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

weireweire commented Apr 27, 2026

Uh oh!

qiching left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants