Skip to content

Add DSV4 Pro GB300 high-throughput recipe#95

Merged
ishandhanani merged 3 commits intoNVIDIA:recipes/dsv4-agg-disaggfrom
weireweire:recipes/dsv4-agg-disagg
Apr 27, 2026
Merged

Add DSV4 Pro GB300 high-throughput recipe#95
ishandhanani merged 3 commits intoNVIDIA:recipes/dsv4-agg-disaggfrom
weireweire:recipes/dsv4-agg-disagg

Conversation

@weireweire
Copy link
Copy Markdown
Collaborator

No description provided.

@weireweire weireweire force-pushed the recipes/dsv4-agg-disagg branch from 0ea486a to 1153bea Compare April 27, 2026 12:58
Copy link
Copy Markdown
Collaborator

@qiching qiching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ishandhanani ishandhanani merged commit 8dca187 into NVIDIA:recipes/dsv4-agg-disagg Apr 27, 2026
1 check passed
ishandhanani added a commit that referenced this pull request Apr 28, 2026
* feat: DeepSeek-V4-Pro disaggregated 1P1D recipe on GB300 (1k/1k)

Adds the dynamo + NIXL disaggregated counterpart to the existing
`gb300-fp4/1k1k-dsv4/agg-*` recipes: 1 prefill node + 1 decode node, both
TP=4 on a single GB300, MXFP4 MoE kernels, chunked-prefill 4096. Same
DSv4-Pro checkpoint and `dsv4-grace-blackwell` container as the agg
recipes; nginx fan-in container is pulled from Docker Hub via enroot.

`benchmark.type` is `manual` so the recipe brings the disagg server up
and stops there — pair with sa-bench (custom_tokenizer
`sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer` + chat
template) once the server is healthy.

README updated with a `Disaggregated` table to keep the existing agg
matrix intact.

Made-with: Cursor

* feat: add 4 multi-node DEP / TP disagg recipes on GB300 (1k/1k)

Builds on the existing 1P1D TP=4 disagg recipe by adding four more
points along the disagg topology curve, all sharing the same dynamo +
NIXL frontend and the `dsv4-grace-blackwell` container:

- disagg-1p1d-dep4-mega-moe.yaml         (2 nodes,  8 GPU; both DEP=4)
- disagg-1p2d-dep4-to-dep8-mega-moe.yaml (3 nodes, 12 GPU; P DEP=4, D DEP=8)
- disagg-2p2d-dep8-mega-moe.yaml         (4 nodes, 16 GPU; both DEP=8)
- disagg-2p2d-tp8-mxfp4.yaml             (4 nodes, 16 GPU; both TP=8, MXFP4)

DEP recipes use TP+DP+DP-attention+DeepEP (mega_moe / DeepGEMM),
mirroring the agg-balanced-tep / agg-max-tpt-tep topology but split
across prefill and decode roles. Multi-node decode recipes intentionally
do NOT set SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 because CAR_V2 is
single-node only and silently corrupts results across nodes.

Also tightens the existing disagg-1p1d-tp4-mxfp4.yaml: switches from
`benchmark.type: manual` to a low-latency sa-bench sweep (conc 4..128)
and adds the same mrr / cgmb / mfs knobs as the new recipes for
reproducibility.

README gains:
- a prominent NIXL state-buffer-fix prerequisite warning (upstream
  sglang PR pending) so reviewers know what container behaviour the
  recipes assume,
- an XPYD = nodes (not instances) clarification,
- a verified-throughput table from sa-bench runs at isl=osl=1024.

Headline: the asymmetric 1P2D DEP4->DEP8 config delivers the highest
per-GPU total token throughput (5,572 TPS/GPU at conc=2048) because at
1k/1k the workload is decode-heavy, so doubling the decode EP domain
(4 -> 8 GPUs) buys far more than scaling prefill.

Recipes are intentionally clean of local mounts / debug paths — pick up
the required nixl/conn.py state-buffer-transfer fix via the container
build process until the upstream sglang fix lands.

Made-with: Cursor

* nixl

* reshuffle like them mf experts

* gb200

* aime

* aime

* fix: add missing mega_moe SGLANG_OPT_* env to dsv4 disagg recipes (#94)

SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=0 only works when
SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 is also set. Without it, DeepEP
buffer is too small for cuda-graph-max-bs=1024/2048 and capture
hits deep_ep.cpp:1233 assertion.

Add the full mega_moe env block to all three *-mega-moe.yaml,
plus SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2=1 only on single-node sides.

* Add DSV4 Pro GB300 high-throughput recipe (#95)

* Add DSV4 Pro GB300 high-throughput recipe

* fix wideep oom

* optimize perf

* moved

* go

* cleanups

* go

---------

Co-authored-by: Yangmin Li <yangminl@nvidia.com>
Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com>
Co-authored-by: weireweire <weiliangl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants