feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) by YAMY1234 · Pull Request #69 · NVIDIA/srt-slurm

YAMY1234 · 2026-04-24T08:49:08Z

Summary

Adds eight SGLang aggregated-serving recipes for DeepSeek-V4-Pro (1.6T MXFP4 MoE) on Grace+Blackwell hardware, plus model / container aliases in srtslurm.yaml.example.

Draft PR — opening early for review of the flag choices and directory layout before I submit final benchmark numbers.

New recipes

file	hardware	parallelism	MTP
`recipes/gb300-fp4/1k1k-dsv4/agg-low-latency.yaml`	GB300 1n	TP=4	EAGLE 3/4
`recipes/gb300-fp4/1k1k-dsv4/agg-nomtp.yaml`	GB300 1n	TP=4	—
`recipes/gb300-fp4/1k1k-dsv4/agg-balanced-tep.yaml`	GB300 1n	TP=4 + DP=4 + DP-attn + DeepEP	EAGLE 1/2
`recipes/gb300-fp4/1k1k-dsv4/agg-max-tpt-tep.yaml`	GB300 1n	TP=4 + DP=4 + DP-attn + DeepEP	—
`recipes/gb300-fp4/1k1k-dsv4/agg-2n-low-latency.yaml`	GB300 2n	TP=8	EAGLE 3/4
`recipes/gb300-fp4/1k1k-dsv4/agg-2n-nomtp.yaml`	GB300 2n	TP=8	—
`recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml`	GB200 2n	TP=8	EAGLE 3/4
`recipes/gb200-fp4/1k1k-dsv4/agg-2n-nomtp.yaml`	GB200 2n	TP=8	—

Source of truth for flags

All configurations are derived from the official SGLang DeepSeek-V4 cookbook:

Structure generator: docs_new/src/snippets/autoregressive/deepseek-v4-deployment.jsx (low-latency / balanced / max-throughput recipes)
Prose doc: docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx
Upstream SGLang PR: DeepSeek V4 sgl-project/sglang#23600

Key cookbook-derived flags:

`moe-runner-backend: flashinfer_mxfp4` — MXFP4 MoE kernels (Blackwell only)
`chunked-prefill-size: 4096` + `disable-flashinfer-autotune: true`
`--speculative-num-steps/topk/draft-tokens 3 1 4` for low-latency, `1 1 2` for balanced
TEP recipes: `--enable-dp-attention --moe-a2a-backend deepep` plus `--deepep-config '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}'` (`DEEPEP_LARGE_SMS_FLAG` for single-node Blackwell per cookbook)

Non-cookbook adjustments (documented in README)

`disable-radix-cache: true` — sa-bench uses synthetic random prompts where radix prefix hits are accidental / non-representative. Also, on GB300 the default `RadixCache` path interacts badly with the MXFP4 `reorder_w1w3_to_w3w1` contiguous-allocation step and causes 20 MiB CUDA OOM despite ~4 GB free (fragmentation). `ChunkCache` path is more conservative here.
`mem-fraction-static: 0.78` — Cookbook uses 0.82 for big MoE models; on GB300 single-node we observed `reorder_w1w3_to_w3w1` OOMing intermittently at 0.82. 0.78 gives ~4% contiguous headroom and is what the cookbook's `cp` recipe uses elsewhere.

srtslurm.yaml.example changes

Added stock aliases for the three DSv4 cookbook container tags
(`dsv4-blackwell`, `dsv4-grace-blackwell`, `dsv4-hopper`) and the two
DSv4 checkpoints (`dsv4-pro`, `dsv4-flash`). These are examples — each
cluster deployment fills in their own paths.

Test plan

All 8 recipes pass `srtctl apply -f ` dry-run with schema validation
Single-node GB300 recipes start healthily and complete a c=1→1024 sweep
2-node GB300 / GB200 recipes start healthily and complete a c=1→1024 sweep
Pareto curves look sensible (monotone along the canonical TPS/User vs TPS/GPU frontier)
TEP recipes don't regress vs pure-TP baseline at the concurrencies the cookbook targets
Secondary pass: confirm no sensitive cluster paths / tokens / S3 creds leaked into committed YAML (I've grepped `lustre/fsw`, `ghp_`, `s3://`, `secret`, `password`, `bearer`)

Will flip to ready-for-review once I have completed sweep data to attach.

Made with Cursor

Adds eight SGLang recipes covering NVIDIA-verified DeepSeek-V4-Pro (1.6T MXFP4 MoE) aggregated-serving configurations on Grace+Blackwell: recipes/gb300-fp4/1k1k-dsv4/ agg-low-latency.yaml — TP=4 + MTP 3/4 (min TPOT) agg-nomtp.yaml — TP=4 (baseline) agg-balanced-tep.yaml — TP=4+DP=4 DeepEP + MTP 1/2 agg-max-tpt-tep.yaml — TP=4+DP=4 DeepEP (max TPS/GPU) agg-2n-low-latency.yaml — TP=8 + MTP 3/4 agg-2n-nomtp.yaml — TP=8 recipes/gb200-fp4/1k1k-dsv4/ agg-2n-low-latency.yaml — TP=8 + MTP 3/4 agg-2n-nomtp.yaml — TP=8 Flag set derived from the SGLang DSv4 cookbook (docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx): * moe-runner-backend: flashinfer_mxfp4 (MXFP4 MoE on Blackwell) * chunked-prefill-size: 4096 + disable-flashinfer-autotune: true * EAGLE spec-decoding 3/4 for low-latency, 1/2 for balanced * TEP recipes: enable-dp-attention + moe-a2a-backend: deepep + deepep-config num_sms=96 (DEEPEP_LARGE_SMS_FLAG, single-node Blackwell) * disable-radix-cache: true (synthetic bench best practice, also reduces allocator fragmentation during MXFP4 weight-reorder) * mem-fraction-static: 0.78 (0.82 intermittently OOMs GB300 during reorder_w1w3_to_w3w1; 0.78 leaves contiguous headroom) srtslurm.yaml.example: added deepseek-v4 model + container aliases. Also adds README.md in each recipe subdir with rationale + reference pointers to the SGLang cookbook and DSv4-Pro model card. Made-with: Cursor

Made-with: Cursor

codecov-commenter · 2026-04-24T08:51:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@54badf2). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #69   +/-   ##
=======================================
  Coverage        ?   70.35%           
=======================================
  Files           ?       59           
  Lines           ?     6270           
  Branches        ?        0           
=======================================
  Hits            ?     4411           
  Misses          ?     1859           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ishandhanani · 2026-04-26T22:23:55Z

Closing this as superseded. The aggregate DSv4 recipe work from this line landed via #70, and the upstream-branch follow-up for the remaining disaggregated recipe additions is now #85 (NVIDIA/srt-slurm:recipes/dsv4-agg-disagg).

YAMY1234 added 2 commits April 24, 2026 01:48

recipes/dsv4: drop warmup_multiplier / formal_multiplier (use defaults)

da535e8

Made-with: Cursor

Oseltamivir mentioned this pull request Apr 24, 2026

Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69) SemiAnalysisAI/InferenceX#1137

Closed

5 tasks

elvischenv mentioned this pull request Apr 24, 2026

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) #70

Merged

Oseltamivir mentioned this pull request Apr 25, 2026

Day 0 DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks SemiAnalysisAI/InferenceX#1157

Merged

5 tasks

ishandhanani closed this Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#69

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#69
YAMY1234 wants to merge 2 commits intoNVIDIA:mainfrom
YAMY1234:dsv4-pro-recipes

YAMY1234 commented Apr 24, 2026

Uh oh!

codecov-commenter commented Apr 24, 2026

Uh oh!

ishandhanani commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YAMY1234 commented Apr 24, 2026

Summary

New recipes

Source of truth for flags

Non-cookbook adjustments (documented in README)

srtslurm.yaml.example changes

Test plan

Uh oh!

codecov-commenter commented Apr 24, 2026

Codecov Report

Uh oh!

ishandhanani commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ishandhanani commented Apr 26, 2026 •

edited

Loading