feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#69
Closed
YAMY1234 wants to merge 2 commits intoNVIDIA:mainfrom
Closed
feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#69YAMY1234 wants to merge 2 commits intoNVIDIA:mainfrom
YAMY1234 wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
added 2 commits
April 24, 2026 01:48
Adds eight SGLang recipes covering NVIDIA-verified DeepSeek-V4-Pro
(1.6T MXFP4 MoE) aggregated-serving configurations on Grace+Blackwell:
recipes/gb300-fp4/1k1k-dsv4/
agg-low-latency.yaml — TP=4 + MTP 3/4 (min TPOT)
agg-nomtp.yaml — TP=4 (baseline)
agg-balanced-tep.yaml — TP=4+DP=4 DeepEP + MTP 1/2
agg-max-tpt-tep.yaml — TP=4+DP=4 DeepEP (max TPS/GPU)
agg-2n-low-latency.yaml — TP=8 + MTP 3/4
agg-2n-nomtp.yaml — TP=8
recipes/gb200-fp4/1k1k-dsv4/
agg-2n-low-latency.yaml — TP=8 + MTP 3/4
agg-2n-nomtp.yaml — TP=8
Flag set derived from the SGLang DSv4 cookbook
(docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx):
* moe-runner-backend: flashinfer_mxfp4 (MXFP4 MoE on Blackwell)
* chunked-prefill-size: 4096 + disable-flashinfer-autotune: true
* EAGLE spec-decoding 3/4 for low-latency, 1/2 for balanced
* TEP recipes: enable-dp-attention + moe-a2a-backend: deepep +
deepep-config num_sms=96 (DEEPEP_LARGE_SMS_FLAG, single-node Blackwell)
* disable-radix-cache: true (synthetic bench best practice, also
reduces allocator fragmentation during MXFP4 weight-reorder)
* mem-fraction-static: 0.78 (0.82 intermittently OOMs GB300 during
reorder_w1w3_to_w3w1; 0.78 leaves contiguous headroom)
srtslurm.yaml.example: added deepseek-v4 model + container aliases.
Also adds README.md in each recipe subdir with rationale + reference
pointers to the SGLang cookbook and DSv4-Pro model card.
Made-with: Cursor
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #69 +/- ##
=======================================
Coverage ? 70.35%
=======================================
Files ? 59
Lines ? 6270
Branches ? 0
=======================================
Hits ? 4411
Misses ? 1859
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5 tasks
Merged
5 tasks
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds eight SGLang aggregated-serving recipes for DeepSeek-V4-Pro (1.6T MXFP4 MoE) on Grace+Blackwell hardware, plus model / container aliases in
srtslurm.yaml.example.Draft PR — opening early for review of the flag choices and directory layout before I submit final benchmark numbers.
New recipes
recipes/gb300-fp4/1k1k-dsv4/agg-low-latency.yamlrecipes/gb300-fp4/1k1k-dsv4/agg-nomtp.yamlrecipes/gb300-fp4/1k1k-dsv4/agg-balanced-tep.yamlrecipes/gb300-fp4/1k1k-dsv4/agg-max-tpt-tep.yamlrecipes/gb300-fp4/1k1k-dsv4/agg-2n-low-latency.yamlrecipes/gb300-fp4/1k1k-dsv4/agg-2n-nomtp.yamlrecipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yamlrecipes/gb200-fp4/1k1k-dsv4/agg-2n-nomtp.yamlSource of truth for flags
All configurations are derived from the official SGLang DeepSeek-V4 cookbook:
docs_new/src/snippets/autoregressive/deepseek-v4-deployment.jsx(low-latency / balanced / max-throughput recipes)docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdxKey cookbook-derived flags:
Non-cookbook adjustments (documented in README)
srtslurm.yaml.example changes
Added stock aliases for the three DSv4 cookbook container tags
(`dsv4-blackwell`, `dsv4-grace-blackwell`, `dsv4-hopper`) and the two
DSv4 checkpoints (`dsv4-pro`, `dsv4-flash`). These are examples — each
cluster deployment fills in their own paths.
Test plan
Will flip to ready-for-review once I have completed sweep data to attach.
Made with Cursor