Skip to content

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#69

Closed
YAMY1234 wants to merge 2 commits intoNVIDIA:mainfrom
YAMY1234:dsv4-pro-recipes
Closed

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#69
YAMY1234 wants to merge 2 commits intoNVIDIA:mainfrom
YAMY1234:dsv4-pro-recipes

Conversation

@YAMY1234
Copy link
Copy Markdown
Collaborator

Summary

Adds eight SGLang aggregated-serving recipes for DeepSeek-V4-Pro (1.6T MXFP4 MoE) on Grace+Blackwell hardware, plus model / container aliases in srtslurm.yaml.example.

Draft PR — opening early for review of the flag choices and directory layout before I submit final benchmark numbers.

New recipes

file hardware parallelism MTP
recipes/gb300-fp4/1k1k-dsv4/agg-low-latency.yaml GB300 1n TP=4 EAGLE 3/4
recipes/gb300-fp4/1k1k-dsv4/agg-nomtp.yaml GB300 1n TP=4
recipes/gb300-fp4/1k1k-dsv4/agg-balanced-tep.yaml GB300 1n TP=4 + DP=4 + DP-attn + DeepEP EAGLE 1/2
recipes/gb300-fp4/1k1k-dsv4/agg-max-tpt-tep.yaml GB300 1n TP=4 + DP=4 + DP-attn + DeepEP
recipes/gb300-fp4/1k1k-dsv4/agg-2n-low-latency.yaml GB300 2n TP=8 EAGLE 3/4
recipes/gb300-fp4/1k1k-dsv4/agg-2n-nomtp.yaml GB300 2n TP=8
recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml GB200 2n TP=8 EAGLE 3/4
recipes/gb200-fp4/1k1k-dsv4/agg-2n-nomtp.yaml GB200 2n TP=8

Source of truth for flags

All configurations are derived from the official SGLang DeepSeek-V4 cookbook:

Key cookbook-derived flags:

  • `moe-runner-backend: flashinfer_mxfp4` — MXFP4 MoE kernels (Blackwell only)
  • `chunked-prefill-size: 4096` + `disable-flashinfer-autotune: true`
  • `--speculative-num-steps/topk/draft-tokens 3 1 4` for low-latency, `1 1 2` for balanced
  • TEP recipes: `--enable-dp-attention --moe-a2a-backend deepep` plus `--deepep-config '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}'` (`DEEPEP_LARGE_SMS_FLAG` for single-node Blackwell per cookbook)

Non-cookbook adjustments (documented in README)

  • `disable-radix-cache: true` — sa-bench uses synthetic random prompts where radix prefix hits are accidental / non-representative. Also, on GB300 the default `RadixCache` path interacts badly with the MXFP4 `reorder_w1w3_to_w3w1` contiguous-allocation step and causes 20 MiB CUDA OOM despite ~4 GB free (fragmentation). `ChunkCache` path is more conservative here.
  • `mem-fraction-static: 0.78` — Cookbook uses 0.82 for big MoE models; on GB300 single-node we observed `reorder_w1w3_to_w3w1` OOMing intermittently at 0.82. 0.78 gives ~4% contiguous headroom and is what the cookbook's `cp` recipe uses elsewhere.

srtslurm.yaml.example changes

Added stock aliases for the three DSv4 cookbook container tags
(`dsv4-blackwell`, `dsv4-grace-blackwell`, `dsv4-hopper`) and the two
DSv4 checkpoints (`dsv4-pro`, `dsv4-flash`). These are examples — each
cluster deployment fills in their own paths.

Test plan

  • All 8 recipes pass `srtctl apply -f ` dry-run with schema validation
  • Single-node GB300 recipes start healthily and complete a c=1→1024 sweep
  • 2-node GB300 / GB200 recipes start healthily and complete a c=1→1024 sweep
  • Pareto curves look sensible (monotone along the canonical TPS/User vs TPS/GPU frontier)
  • TEP recipes don't regress vs pure-TP baseline at the concurrencies the cookbook targets
  • Secondary pass: confirm no sensitive cluster paths / tokens / S3 creds leaked into committed YAML (I've grepped `lustre/fsw`, `ghp_`, `s3://`, `secret`, `password`, `bearer`)

Will flip to ready-for-review once I have completed sweep data to attach.

Made with Cursor

YAMY1234 added 2 commits April 24, 2026 01:48
Adds eight SGLang recipes covering NVIDIA-verified DeepSeek-V4-Pro
(1.6T MXFP4 MoE) aggregated-serving configurations on Grace+Blackwell:

  recipes/gb300-fp4/1k1k-dsv4/
    agg-low-latency.yaml       — TP=4 + MTP 3/4   (min TPOT)
    agg-nomtp.yaml             — TP=4             (baseline)
    agg-balanced-tep.yaml      — TP=4+DP=4 DeepEP + MTP 1/2
    agg-max-tpt-tep.yaml       — TP=4+DP=4 DeepEP (max TPS/GPU)
    agg-2n-low-latency.yaml    — TP=8 + MTP 3/4
    agg-2n-nomtp.yaml          — TP=8

  recipes/gb200-fp4/1k1k-dsv4/
    agg-2n-low-latency.yaml    — TP=8 + MTP 3/4
    agg-2n-nomtp.yaml          — TP=8

Flag set derived from the SGLang DSv4 cookbook
(docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx):

  * moe-runner-backend: flashinfer_mxfp4      (MXFP4 MoE on Blackwell)
  * chunked-prefill-size: 4096 + disable-flashinfer-autotune: true
  * EAGLE spec-decoding 3/4 for low-latency, 1/2 for balanced
  * TEP recipes: enable-dp-attention + moe-a2a-backend: deepep +
    deepep-config num_sms=96 (DEEPEP_LARGE_SMS_FLAG, single-node Blackwell)
  * disable-radix-cache: true (synthetic bench best practice, also
    reduces allocator fragmentation during MXFP4 weight-reorder)
  * mem-fraction-static: 0.78 (0.82 intermittently OOMs GB300 during
    reorder_w1w3_to_w3w1; 0.78 leaves contiguous headroom)

srtslurm.yaml.example: added deepseek-v4 model + container aliases.

Also adds README.md in each recipe subdir with rationale + reference
pointers to the SGLang cookbook and DSv4-Pro model card.

Made-with: Cursor
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@54badf2). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #69   +/-   ##
=======================================
  Coverage        ?   70.35%           
=======================================
  Files           ?       59           
  Lines           ?     6270           
  Branches        ?        0           
=======================================
  Hits            ?     4411           
  Misses          ?     1859           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ishandhanani
Copy link
Copy Markdown
Collaborator

ishandhanani commented Apr 26, 2026

Closing this as superseded. The aggregate DSv4 recipe work from this line landed via #70, and the upstream-branch follow-up for the remaining disaggregated recipe additions is now #85 (NVIDIA/srt-slurm:recipes/dsv4-agg-disagg).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants