Skip to content

Consolidate B200 recipes: merge 40 per-variant STP/MTP files into 4 combined files#206

Merged
weireweire merged 7 commits intomainfrom
simplify/b200-recipes
Mar 4, 2026
Merged

Consolidate B200 recipes: merge 40 per-variant STP/MTP files into 4 combined files#206
weireweire merged 7 commits intomainfrom
simplify/b200-recipes

Conversation

@weireweire
Copy link
Copy Markdown
Collaborator

@weireweire weireweire commented Mar 4, 2026

Summary

  • Merged 40 individual per-variant YAML files (under b200-fp8/1k1k/stp|mtp/, b200-fp8/8k1k/stp|mtp/, b200-fp4/1k1k/stp|mtp/, b200-fp4/8k1k/stp|mtp/) into 4 combined recipe files, one per precision×isl
  • Each combined file uses a shared base plus zip_override_* / override_* blocks to express STP and MTP variants with minimal duplication
  • Added inline section comments (# Model configuration, # Disaggregation mode, # Memory and token limits, # Parallelism, # Attention, # MoE, # Other flags) matching the originals
  • All 40 variants verified equivalent to originals via diff script before deletion
  • Deleted the old per-variant subdirectories

Test plan

  • make check passes (336 tests, lint clean)
  • Diff script confirms all 40 variants are semantically equivalent to original files
  • srtctl dry-run -f recipes/b200-fp8/1k1k.yaml previews expected configs
  • srtctl dry-run -f recipes/b200-fp4/8k1k.yaml previews expected configs

Summary by CodeRabbit

Release Notes

  • New Features

    • Added B200-FP4 and B200-FP8 inference configurations for 1k1k and 8k1k model deployments
    • Introduced multiple deployment variants: low-latency and max-throughput profiles for both standard and speculative token prediction modes
    • Enabled speculative decoding (EAGLE) support in multi-token prediction configurations
  • Chores

    • Consolidated individual configuration files into unified variant-based deployment templates

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 4, 2026

Warning

Rate limit exceeded

@weireweire has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 14 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a48fb719-efcb-49f3-a040-3874181d5da8

📥 Commits

Reviewing files that changed from the base of the PR and between ac84b66 and 42af38e.

📒 Files selected for processing (45)
  • recipes/b200-fp4/1k1k.yaml
  • recipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-6d.yaml
  • recipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-1d.yaml
  • recipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-2d.yaml
  • recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml
  • recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-1d.yaml
  • recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-2d.yaml
  • recipes/b200-fp4/8k1k.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-1d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-dep4-2p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-tp4-1p-tp8-1d.yaml
  • recipes/b200-fp4/8k1k/mtp/max-tpt-dep4-4p-dep8-1d.yaml
  • recipes/b200-fp4/8k1k/mtp/max-tpt-dep4-7p-dep8-2d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-dep4-2p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-tp4-1p-tp8-1d.yaml
  • recipes/b200-fp4/8k1k/stp/max-tpt-dep4-7p-dep8-2d.yaml
  • recipes/b200-fp8/1k1k.yaml
  • recipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p3d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p1d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p2d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p5d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-2p5d.yaml
  • recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml
  • recipes/b200-fp8/1k1k/stp/max-tpt-dep8-1p5d.yaml
  • recipes/b200-fp8/1k1k/stp/max-tpt-dep8-2p5d.yaml
  • recipes/b200-fp8/8k1k.yaml
  • recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p4d.yaml
  • recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p6d.yaml
  • recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p1d.yaml
  • recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p2d.yaml
  • recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-2p1d.yaml
  • recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p4d.yaml
  • recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p6d.yaml
  • recipes/b200-fp8/8k1k/stp/max-tpt-dep8-1p1d.yaml
  • recipes/b200-fp8/8k1k/stp/max-tpt-dep8-2p1d.yaml
  • tests/test_override.py
📝 Walkthrough

Walkthrough

This PR consolidates B200-FP4 and B200-FP8 deployment recipes by centralizing multiple variant configurations into unified YAML files with override keys. Four comprehensive recipe files are added (1k1k and 8k1k for each precision), replacing numerous individual variant files covering STP/MTP inference modes.

Changes

Cohort / File(s) Summary
B200-FP4 Consolidated Recipes
recipes/b200-fp4/1k1k.yaml, recipes/b200-fp4/8k1k.yaml
Adds two comprehensive YAML configurations with base settings and multiple variant overrides (zip_override_stp_lowlat, zip_override_mtp_lowlat, zip_override_stp_maxtpt, zip_override_mtp_maxtpt, etc.) for STP/MTP inference modes, disaggregation patterns, and throughput/latency profiles. Centralizes model, resources, backend, sglang_config, health_check, and benchmark definitions.
B200-FP4 Individual Variants (Deletions)
recipes/b200-fp4/1k1k/{mtp,stp}/*.yaml, recipes/b200-fp4/8k1k/{mtp,stp}/*.yaml
Removes individual variant configuration files (low-latency and max-tpt profiles with various node/worker/decode scales). Content consolidated into parent 1k1k.yaml and 8k1k.yaml files. ~14 files deleted.
B200-FP8 Consolidated Recipes
recipes/b200-fp8/1k1k.yaml, recipes/b200-fp8/8k1k.yaml
Adds two comprehensive FP8 YAML configurations with base settings and variant overrides (zip_override_stp_lowlat, zip_override_mtp_lowlat, zip_override_stp_maxtpt, zip_override_mtp_maxtpt, etc.) mirroring FP4 structure with FP8-specific tuning and resource allocations.
B200-FP8 Individual Variants (Deletions)
recipes/b200-fp8/1k1k/{mtp,stp}/*.yaml, recipes/b200-fp8/8k1k/{mtp,stp}/*.yaml
Removes individual FP8 variant configuration files across low-latency and max-tpt profiles. Content consolidated into parent 1k1k.yaml and 8k1k.yaml files. ~14 files deleted.
Test Updates
tests/test_override.py
Updates comment text in test file to clarify override auto-naming behavior (minor documentation change).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Update MTP recipe with multiple draft steps #78 — Updates MTP speculative decoding settings in deployment recipes, including SGLANG_ENABLE_SPEC_V2 and draft token configurations that align with variant definitions in this PR.
  • Update gb200 recipes #130 — Modifies deployment recipe YAMLs with SGLANG/backend keys, disaggregation-transfer-backend, and model container updates that share pattern similarities with this consolidation.

Suggested reviewers

  • ishandhanani
  • gracehonv

Poem

🐰 Four recipes now dance in unified grace,
Where variants fold in a consolidated space,
From scattered configs to YAML's neat fold,
B200 spins faster, both FP4 and bold—
Override keys orchestrate latency's race! 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately summarizes the main objective: consolidating 40 individual per-variant recipe files into 4 combined files using override blocks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch simplify/b200-recipes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/b200-fp4/8k1k.yaml`:
- Line 10: The usage comment claiming "all 12 variants" is incorrect; there are
11 non-base variants in this recipe. Update the inline comment on the top-line
usage (the "srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 12 variants"
string) to the correct count (e.g., "all 11 variants"), or alternatively
add/remove variant entries so the number matches; verify by counting the variant
definitions in this file (the entries that define variant names) and make the
comment consistent with the actual variants.

In `@recipes/b200-fp8/8k1k.yaml`:
- Around line 17-253: CI schema validation is failing because this recipe uses a
base + override pattern (the top-level "base" block and override groups like
"zip_override_stp_lowlat", "zip_override_mtp_lowlat", "zip_override_stp_maxtpt",
"zip_override_mtp_maxtpt") but the validator treats the file as a single
concrete config; update the validator to detect files that define a "base" plus
"override" or "zip_override_*" keys and validate by expanding the base with each
override variant (or validating the base and each merged variant) rather than
validating the raw file shape, ensuring required fields in the expanded configs
are present and unknown-field errors are reported against the merged variants
instead of the override wrapper.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 920738ea-191a-49e8-8d0f-438a31784b84

📥 Commits

Reviewing files that changed from the base of the PR and between 3053690 and ac84b66.

📒 Files selected for processing (45)
  • recipes/b200-fp4/1k1k.yaml
  • recipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-6d.yaml
  • recipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-1d.yaml
  • recipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-2d.yaml
  • recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml
  • recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-1d.yaml
  • recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-2d.yaml
  • recipes/b200-fp4/8k1k.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-1d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-dep4-2p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-tp4-1p-tp8-1d.yaml
  • recipes/b200-fp4/8k1k/mtp/max-tpt-dep4-4p-dep8-1d.yaml
  • recipes/b200-fp4/8k1k/mtp/max-tpt-dep4-7p-dep8-2d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-dep4-2p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-tp4-1p-tp8-1d.yaml
  • recipes/b200-fp4/8k1k/stp/max-tpt-dep4-7p-dep8-2d.yaml
  • recipes/b200-fp8/1k1k.yaml
  • recipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p3d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p1d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p2d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p5d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-2p5d.yaml
  • recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml
  • recipes/b200-fp8/1k1k/stp/max-tpt-dep8-1p5d.yaml
  • recipes/b200-fp8/1k1k/stp/max-tpt-dep8-2p5d.yaml
  • recipes/b200-fp8/8k1k.yaml
  • recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p4d.yaml
  • recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p6d.yaml
  • recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p1d.yaml
  • recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p2d.yaml
  • recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-2p1d.yaml
  • recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p4d.yaml
  • recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p6d.yaml
  • recipes/b200-fp8/8k1k/stp/max-tpt-dep8-1p1d.yaml
  • recipes/b200-fp8/8k1k/stp/max-tpt-dep8-2p1d.yaml
  • tests/test_override.py
💤 Files with no reviewable changes (40)
  • recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p4d.yaml
  • recipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p3d.yaml
  • recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-2d.yaml
  • recipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-6d.yaml
  • recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p6d.yaml
  • recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p2d.yaml
  • recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-2p1d.yaml
  • recipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-2p5d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-1d.yaml
  • recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p6d.yaml
  • recipes/b200-fp8/1k1k/stp/max-tpt-dep8-2p5d.yaml
  • recipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-2d.yaml
  • recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp8/1k1k/stp/max-tpt-dep8-1p5d.yaml
  • recipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-1d.yaml
  • recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp4/8k1k/mtp/max-tpt-dep4-4p-dep8-1d.yaml
  • recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p1d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml
  • recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p1d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p2d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-tp4-1p-tp8-1d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yaml
  • recipes/b200-fp8/8k1k/stp/max-tpt-dep8-2p1d.yaml
  • recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-1d.yaml
  • recipes/b200-fp4/8k1k/stp/max-tpt-dep4-7p-dep8-2d.yaml
  • recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml
  • recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p4d.yaml
  • recipes/b200-fp4/8k1k/mtp/low-latency-dep4-2p-tep8-5d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-tp4-1p-tp8-1d.yaml
  • recipes/b200-fp8/8k1k/stp/max-tpt-dep8-1p1d.yaml
  • recipes/b200-fp4/8k1k/stp/low-latency-dep4-2p-tep8-5d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p5d.yaml
  • recipes/b200-fp4/8k1k/mtp/max-tpt-dep4-7p-dep8-2d.yaml
  • recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p1d.yaml

Comment thread recipes/b200-fp4/8k1k.yaml Outdated
# override_mtp_maxtpt_4p1d: MTP-only 4p1d, no frontends, env-var FP4 backend
#
# Usage:
# srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 12 variants
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Usage comment variant count appears off by one.

The file currently defines 11 non-base variants, not 12.

Suggested correction
-#   srtctl apply  -f recipes/b200-fp4/8k1k.yaml                              # all 12 variants
+#   srtctl apply  -f recipes/b200-fp4/8k1k.yaml                              # all 11 variants
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 12 variants
# srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 11 variants
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/b200-fp4/8k1k.yaml` at line 10, The usage comment claiming "all 12
variants" is incorrect; there are 11 non-base variants in this recipe. Update
the inline comment on the top-line usage (the "srtctl apply -f
recipes/b200-fp4/8k1k.yaml # all 12 variants" string) to the correct count
(e.g., "all 11 variants"), or alternatively add/remove variant entries so the
number matches; verify by counting the variant definitions in this file (the
entries that define variant names) and make the comment consistent with the
actual variants.

Comment on lines +17 to +253
base:
name: "b200-fp8-stp-8k1k"

model:
path: "dsr1-fp8"
container: "dynamo-sglang"
precision: "fp8"

resources:
gpu_type: "b200"
prefill_nodes: 1
prefill_workers: 1
decode_nodes: 1
decode_workers: 1
gpus_per_node: 8

backend:
prefill_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
PYTHONUNBUFFERED: "1"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
DYN_REQUEST_PLANE: nats
decode_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
PYTHONUNBUFFERED: "1"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
DYN_REQUEST_PLANE: nats
sglang_config:
prefill:
# Model configuration
served-model-name: "deepseek-ai/DeepSeek-R1"
trust-remote-code: true
quantization: "fp8"

# Disaggregation mode
disaggregation-mode: "prefill"
disaggregation-transfer-backend: nixl

# Memory and token limits
mem-fraction-static: 0.85
max-prefill-tokens: 32768
chunked-prefill-size: 32768
context-length: 9600
max-running-requests: 512
disable-cuda-graph: true

# Parallelism
tensor-parallel-size: 8
data-parallel-size: 1
expert-parallel-size: 8

# Attention
attention-backend: "trtllm_mla"
kv-cache-dtype: "fp8_e4m3"

# MoE
moe-runner-backend: "flashinfer_trtllm"
# moe-dense-tp-size: 1

# Other flags
stream-interval: 30
watchdog-timeout: 1000000
enable-flashinfer-allreduce-fusion: true
disable-radix-cache: true

decode:
# Model configuration
served-model-name: "deepseek-ai/DeepSeek-R1"
trust-remote-code: true
quantization: "fp8"

# Disaggregation mode
disaggregation-mode: "decode"
disaggregation-transfer-backend: nixl

# Memory and token limits
mem-fraction-static: 0.85
max-prefill-tokens: 32768
chunked-prefill-size: 32768
context-length: 9600
max-running-requests: 512
cuda-graph-max-bs: 512

# Parallelism
tensor-parallel-size: 8
data-parallel-size: 1
expert-parallel-size: 8

# Attention
attention-backend: "trtllm_mla"
kv-cache-dtype: "fp8_e4m3"

# MoE
moe-runner-backend: "flashinfer_trtllm"
# moe-dense-tp-size: 1

# Other flags
stream-interval: 30
watchdog-timeout: 1000000
enable-flashinfer-allreduce-fusion: true
disable-radix-cache: true
# disable-chunked-prefix-cache: true

health_check:
max_attempts: 360
interval_seconds: 10

benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
req_rate: "inf"


# STP low-latency: tep8 decode (DP=1), scale sweep 1p1d/1p4d/1p6d
zip_override_stp_lowlat:
name:
- "b200-fp8-stp-low-latency-tep8-1p-1d"
- "b200-fp8-stp-low-latency-tep8-1p-4d"
- "b200-fp8-stp-low-latency-tep8-1p-6d"
resources:
decode_nodes: [1, 4, 6]
decode_workers: [1, 4, 6]
benchmark:
concurrencies: ["4x32x64", "64", "32"]


# MTP low-latency: same scales as STP, adds EAGLE speculative decoding
zip_override_mtp_lowlat:
name:
- "b200-fp8-mtp-low-latency-tep8-1p-1d"
- "b200-fp8-mtp-low-latency-tep8-1p-4d"
- "b200-fp8-mtp-low-latency-tep8-1p-6d"
resources:
decode_nodes: [1, 4, 6]
decode_workers: [1, 4, 6]
backend:
prefill_environment:
SGLANG_ENABLE_SPEC_V2: "1"
decode_environment:
SGLANG_ENABLE_SPEC_V2: "1"
sglang_config:
prefill:
moe-dense-tp-size: 1
decode:
speculative-algorithm: "EAGLE"
speculative-num-steps: 2
speculative-eagle-topk: 1
speculative-num-draft-tokens: 3
benchmark:
concurrencies: ["16x32x64", "8x256", "4x8x16x256"]


# STP max-throughput: dep8 decode (DP=8), scale sweep 1p1d and 2p1d
zip_override_stp_maxtpt:
name:
- "b200-fp8-stp-max-tpt-dep8-1p-1d"
- "b200-fp8-stp-max-tpt-dep8-2p-1d"
resources:
prefill_nodes: [1, 2]
prefill_workers: [1, 2]
decode_nodes: [1, 1]
decode_workers: [1, 1]
backend:
sglang_config:
prefill:
data-parallel-size: 8
enable-dp-attention: true
enable-dp-lm-head: true
moe-dense-tp-size: 1
max-running-requests: 1024
decode:
data-parallel-size: 8
enable-dp-attention: true
enable-dp-lm-head: true
moe-dense-tp-size: 1
max-running-requests: 1024
cuda-graph-max-bs: 1024
benchmark:
concurrencies: ["128", "256"]


# MTP max-throughput: dep8 decode, scale sweep 1p1d/1p2d/2p1d, adds EAGLE speculative decoding
# Note: max-running-requests stays at 512 for MTP (unlike STP which raises to 1024)
zip_override_mtp_maxtpt:
name:
- "b200-fp8-mtp-max-tpt-dep8-1p-1d"
- "b200-fp8-mtp-max-tpt-dep8-1p-2d"
- "b200-fp8-mtp-max-tpt-dep8-2p-1d"
resources:
prefill_nodes: [1, 1, 2]
prefill_workers: [1, 1, 2]
decode_nodes: [1, 2, 1]
decode_workers: [1, 2, 1]
backend:
prefill_environment:
SGLANG_ENABLE_SPEC_V2: "1"
decode_environment:
SGLANG_ENABLE_SPEC_V2: "1"
sglang_config:
prefill:
data-parallel-size: 8
enable-dp-attention: true
enable-dp-lm-head: true
moe-dense-tp-size: 1
decode:
data-parallel-size: 8
enable-dp-attention: true
enable-dp-lm-head: true
moe-dense-tp-size: 1
speculative-algorithm: "EAGLE"
speculative-num-steps: 2
speculative-eagle-topk: 1
speculative-num-draft-tokens: 3
benchmark:
concurrencies: ["256", "128x256x512x1024", "128x512"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Override-format recipe is currently blocked by CI schema validation.

This file uses base + zip_override_*/override_*, but CI is validating it as a single concrete config, which causes the reported missing required fields and unknown fields. Please make recipe validation override-aware (detect override configs and validate expanded variants) before merge.

I can help draft the validator-side patch if you want.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/b200-fp8/8k1k.yaml` around lines 17 - 253, CI schema validation is
failing because this recipe uses a base + override pattern (the top-level "base"
block and override groups like "zip_override_stp_lowlat",
"zip_override_mtp_lowlat", "zip_override_stp_maxtpt", "zip_override_mtp_maxtpt")
but the validator treats the file as a single concrete config; update the
validator to detect files that define a "base" plus "override" or
"zip_override_*" keys and validate by expanding the base with each override
variant (or validating the base and each merged variant) rather than validating
the raw file shape, ensuring required fields in the expanded configs are present
and unknown-field errors are reported against the merged variants instead of the
override wrapper.

@weireweire weireweire changed the title Consolidate B200 recipes: merge per-variant STP/MTP files into 4 combined files Consolidate B200 recipes: merge 40 per-variant STP/MTP files into 4 combined files Mar 4, 2026
weiliangl added 7 commits March 4, 2026 06:03
Reduce 40 individual recipe files to 8 override files (one per
precision × isl × stp/mtp combination).

Each file uses zip_override_scale to sweep all node-count variants,
eliminating per-variant YAML duplication. FP4 8k1k files additionally
use override_tp4 to cover the TP4 prefill mode alongside the default
dep4 variants.

Before: b200-fp8 (21 files) + b200-fp4 (19 files) = 40 files
After:  8 override files covering all same variants

  recipes/b200-fp8/1k1k-stp.yaml  (4 variants: 1p1d/1p3d low-lat, 1p5d/2p5d max-tpt)
  recipes/b200-fp8/1k1k-mtp.yaml  (6 variants)
  recipes/b200-fp8/8k1k-stp.yaml  (5 variants: 1p1d/1p4d/1p6d low-lat, 1p1d/2p1d max-tpt)
  recipes/b200-fp8/8k1k-mtp.yaml  (6 variants)
  recipes/b200-fp4/1k1k-stp.yaml  (4 variants: 1p5d/1p6d low-lat, 1p1d/1p2d max-tpt)
  recipes/b200-fp4/1k1k-mtp.yaml  (4 variants)
  recipes/b200-fp4/8k1k-stp.yaml  (5 dep4 variants + override_tp4)
  recipes/b200-fp4/8k1k-mtp.yaml  (5 dep4 variants + override_tp4)
…ions

Recipe fixes:
- Move num_additional_frontends from resources: to frontend: in FP4 8k1k
  files (was causing schema validation Unknown field error)
- Fix override_maxtpt_4p1d: use frontend: null to drop frontend config
  (original file has no frontend section)
- Fix override_tp4: remove erroneous fp4-gemm-backend: null (original
  tp4 file keeps flashinfer_trtllm backend), add decode expert-parallel-size: 1
- Separate low-lat and max-tpt into distinct zip_override_ groups so each
  carries appropriate sglang_config overrides (DP=8, moe-dense-tp-size, etc.)
- FP4 1k1k MTP max-tpt: add per-variant mem-fraction-static list [0.75, 0.85]
- FP8 MTP max-tpt: keep max-running-requests=512 (STP raises to 1024, MTP does not)
- FP8 1k1k MTP: add override_maxtpt_1p2d special case with spec-steps=1, draft-tokens=2

Core fix:
- generate_override_configs: respect explicit name: field in override_* dicts
  instead of always auto-generating {base_name}_{suffix}; add test coverage
Consolidate 8 separate *-stp.yaml / *-mtp.yaml files into 4 combined
files (b200-fp8/1k1k.yaml, b200-fp8/8k1k.yaml, b200-fp4/1k1k.yaml,
b200-fp4/8k1k.yaml).

Override key names include stp/mtp labels (zip_override_stp_lowlat,
zip_override_mtp_maxtpt, etc.) enabling wildcard selectors:
  srtctl apply -f recipes/b200-fp8/1k1k.yaml:*stp*  # all STP variants
  srtctl apply -f recipes/b200-fp8/1k1k.yaml:*mtp*  # all MTP variants

FP4 8k1k 7p2d uses the null mechanism to combine STP and MTP into one
zip_override_maxtpt_7p2d section: null in STP slots is a no-op (keys
absent from base); values in MTP slots add SGLANG_ENABLE_SPEC_V2 and
speculative settings on top of the same resources/sglang config.
…om old files

- Replace zip_override_maxtpt_7p2d (null-mechanism combined) with explicit
  override_stp_maxtpt_7p2d and override_mtp_maxtpt_7p2d in b200-fp4/8k1k.yaml
- Verified all 4 combined files produce configs identical to the original
  individual stp/mtp files (compared with Python deep-diff, excluding name field)
- Add scale-sweep notes and backend notes from old files as section comments
Adds parameter grouping comments (# Model configuration, # Disaggregation
mode, # Memory and token limits, # Parallelism, # Attention, # MoE,
# Other flags) to the base sglang_config blocks in all four combined
recipe files, matching the originals in the per-variant subdirectories.
Also preserves commented-out hints (# moe-dense-tp-size: 1,
# disable-chunked-prefix-cache: true) from the original files.

All 40 variants verified equivalent to originals via diff script.
The 40 individual stp/mtp YAML files under b200-fp8/1k1k/, b200-fp8/8k1k/,
b200-fp4/1k1k/, and b200-fp4/8k1k/ are now consolidated into 4 combined
recipe files (one per precision×isl). All variants verified equivalent via
diff script before deletion.
@weireweire weireweire force-pushed the simplify/b200-recipes branch from ac84b66 to 42af38e Compare March 4, 2026 06:04
@weireweire weireweire merged commit 37f8ca2 into main Mar 4, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant