Consolidate B200 recipes: merge 40 per-variant STP/MTP files into 4 combined files#206
Consolidate B200 recipes: merge 40 per-variant STP/MTP files into 4 combined files#206weireweire merged 7 commits intomainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (45)
📝 WalkthroughWalkthroughThis PR consolidates B200-FP4 and B200-FP8 deployment recipes by centralizing multiple variant configurations into unified YAML files with override keys. Four comprehensive recipe files are added (1k1k and 8k1k for each precision), replacing numerous individual variant files covering STP/MTP inference modes. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@recipes/b200-fp4/8k1k.yaml`:
- Line 10: The usage comment claiming "all 12 variants" is incorrect; there are
11 non-base variants in this recipe. Update the inline comment on the top-line
usage (the "srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 12 variants"
string) to the correct count (e.g., "all 11 variants"), or alternatively
add/remove variant entries so the number matches; verify by counting the variant
definitions in this file (the entries that define variant names) and make the
comment consistent with the actual variants.
In `@recipes/b200-fp8/8k1k.yaml`:
- Around line 17-253: CI schema validation is failing because this recipe uses a
base + override pattern (the top-level "base" block and override groups like
"zip_override_stp_lowlat", "zip_override_mtp_lowlat", "zip_override_stp_maxtpt",
"zip_override_mtp_maxtpt") but the validator treats the file as a single
concrete config; update the validator to detect files that define a "base" plus
"override" or "zip_override_*" keys and validate by expanding the base with each
override variant (or validating the base and each merged variant) rather than
validating the raw file shape, ensuring required fields in the expanded configs
are present and unknown-field errors are reported against the merged variants
instead of the override wrapper.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 920738ea-191a-49e8-8d0f-438a31784b84
📒 Files selected for processing (45)
recipes/b200-fp4/1k1k.yamlrecipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-5d.yamlrecipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-6d.yamlrecipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-1d.yamlrecipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-2d.yamlrecipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-5d.yamlrecipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yamlrecipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-1d.yamlrecipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-2d.yamlrecipes/b200-fp4/8k1k.yamlrecipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-1d.yamlrecipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-5d.yamlrecipes/b200-fp4/8k1k/mtp/low-latency-dep4-2p-tep8-5d.yamlrecipes/b200-fp4/8k1k/mtp/low-latency-tp4-1p-tp8-1d.yamlrecipes/b200-fp4/8k1k/mtp/max-tpt-dep4-4p-dep8-1d.yamlrecipes/b200-fp4/8k1k/mtp/max-tpt-dep4-7p-dep8-2d.yamlrecipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yamlrecipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yamlrecipes/b200-fp4/8k1k/stp/low-latency-dep4-2p-tep8-5d.yamlrecipes/b200-fp4/8k1k/stp/low-latency-tp4-1p-tp8-1d.yamlrecipes/b200-fp4/8k1k/stp/max-tpt-dep4-7p-dep8-2d.yamlrecipes/b200-fp8/1k1k.yamlrecipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p1d.yamlrecipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p3d.yamlrecipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p1d.yamlrecipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p2d.yamlrecipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p5d.yamlrecipes/b200-fp8/1k1k/mtp/max-tpt-dep8-2p5d.yamlrecipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yamlrecipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yamlrecipes/b200-fp8/1k1k/stp/max-tpt-dep8-1p5d.yamlrecipes/b200-fp8/1k1k/stp/max-tpt-dep8-2p5d.yamlrecipes/b200-fp8/8k1k.yamlrecipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p1d.yamlrecipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p4d.yamlrecipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p6d.yamlrecipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p1d.yamlrecipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p2d.yamlrecipes/b200-fp8/8k1k/mtp/max-tpt-dep8-2p1d.yamlrecipes/b200-fp8/8k1k/stp/low-latency-tep8-1p1d.yamlrecipes/b200-fp8/8k1k/stp/low-latency-tep8-1p4d.yamlrecipes/b200-fp8/8k1k/stp/low-latency-tep8-1p6d.yamlrecipes/b200-fp8/8k1k/stp/max-tpt-dep8-1p1d.yamlrecipes/b200-fp8/8k1k/stp/max-tpt-dep8-2p1d.yamltests/test_override.py
💤 Files with no reviewable changes (40)
- recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p4d.yaml
- recipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p3d.yaml
- recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-2d.yaml
- recipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-6d.yaml
- recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p6d.yaml
- recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p2d.yaml
- recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-2p1d.yaml
- recipes/b200-fp4/1k1k/mtp/low-latency-dep4-1p-tep8-5d.yaml
- recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml
- recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-2p5d.yaml
- recipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-1d.yaml
- recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p1d.yaml
- recipes/b200-fp8/1k1k/mtp/low-latency-tep8-1p1d.yaml
- recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p6d.yaml
- recipes/b200-fp8/1k1k/stp/max-tpt-dep8-2p5d.yaml
- recipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-2d.yaml
- recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-5d.yaml
- recipes/b200-fp4/8k1k/mtp/low-latency-dep4-1p-tep8-5d.yaml
- recipes/b200-fp8/1k1k/stp/max-tpt-dep8-1p5d.yaml
- recipes/b200-fp4/1k1k/mtp/max-tpt-dep4-1p-dep8-1d.yaml
- recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yaml
- recipes/b200-fp4/8k1k/mtp/max-tpt-dep4-4p-dep8-1d.yaml
- recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p1d.yaml
- recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml
- recipes/b200-fp8/8k1k/mtp/max-tpt-dep8-1p1d.yaml
- recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p2d.yaml
- recipes/b200-fp4/8k1k/mtp/low-latency-tp4-1p-tp8-1d.yaml
- recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yaml
- recipes/b200-fp8/8k1k/stp/max-tpt-dep8-2p1d.yaml
- recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-1d.yaml
- recipes/b200-fp4/8k1k/stp/max-tpt-dep4-7p-dep8-2d.yaml
- recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml
- recipes/b200-fp8/8k1k/mtp/low-latency-tep8-1p4d.yaml
- recipes/b200-fp4/8k1k/mtp/low-latency-dep4-2p-tep8-5d.yaml
- recipes/b200-fp4/8k1k/stp/low-latency-tp4-1p-tp8-1d.yaml
- recipes/b200-fp8/8k1k/stp/max-tpt-dep8-1p1d.yaml
- recipes/b200-fp4/8k1k/stp/low-latency-dep4-2p-tep8-5d.yaml
- recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p5d.yaml
- recipes/b200-fp4/8k1k/mtp/max-tpt-dep4-7p-dep8-2d.yaml
- recipes/b200-fp8/1k1k/mtp/max-tpt-dep8-1p1d.yaml
| # override_mtp_maxtpt_4p1d: MTP-only 4p1d, no frontends, env-var FP4 backend | ||
| # | ||
| # Usage: | ||
| # srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 12 variants |
There was a problem hiding this comment.
Usage comment variant count appears off by one.
The file currently defines 11 non-base variants, not 12.
Suggested correction
-# srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 12 variants
+# srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 11 variants📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 12 variants | |
| # srtctl apply -f recipes/b200-fp4/8k1k.yaml # all 11 variants |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@recipes/b200-fp4/8k1k.yaml` at line 10, The usage comment claiming "all 12
variants" is incorrect; there are 11 non-base variants in this recipe. Update
the inline comment on the top-line usage (the "srtctl apply -f
recipes/b200-fp4/8k1k.yaml # all 12 variants" string) to the correct count
(e.g., "all 11 variants"), or alternatively add/remove variant entries so the
number matches; verify by counting the variant definitions in this file (the
entries that define variant names) and make the comment consistent with the
actual variants.
| base: | ||
| name: "b200-fp8-stp-8k1k" | ||
|
|
||
| model: | ||
| path: "dsr1-fp8" | ||
| container: "dynamo-sglang" | ||
| precision: "fp8" | ||
|
|
||
| resources: | ||
| gpu_type: "b200" | ||
| prefill_nodes: 1 | ||
| prefill_workers: 1 | ||
| decode_nodes: 1 | ||
| decode_workers: 1 | ||
| gpus_per_node: 8 | ||
|
|
||
| backend: | ||
| prefill_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
| PYTHONUNBUFFERED: "1" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" | ||
| SGLANG_ENABLE_JIT_DEEPGEMM: "false" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
| MC_FORCE_MNNVL: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| DYN_REQUEST_PLANE: nats | ||
| decode_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
| PYTHONUNBUFFERED: "1" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" | ||
| SGLANG_ENABLE_JIT_DEEPGEMM: "false" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
| MC_FORCE_MNNVL: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| DYN_REQUEST_PLANE: nats | ||
| sglang_config: | ||
| prefill: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| trust-remote-code: true | ||
| quantization: "fp8" | ||
|
|
||
| # Disaggregation mode | ||
| disaggregation-mode: "prefill" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.85 | ||
| max-prefill-tokens: 32768 | ||
| chunked-prefill-size: 32768 | ||
| context-length: 9600 | ||
| max-running-requests: 512 | ||
| disable-cuda-graph: true | ||
|
|
||
| # Parallelism | ||
| tensor-parallel-size: 8 | ||
| data-parallel-size: 1 | ||
| expert-parallel-size: 8 | ||
|
|
||
| # Attention | ||
| attention-backend: "trtllm_mla" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
||
| # MoE | ||
| moe-runner-backend: "flashinfer_trtllm" | ||
| # moe-dense-tp-size: 1 | ||
|
|
||
| # Other flags | ||
| stream-interval: 30 | ||
| watchdog-timeout: 1000000 | ||
| enable-flashinfer-allreduce-fusion: true | ||
| disable-radix-cache: true | ||
|
|
||
| decode: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| trust-remote-code: true | ||
| quantization: "fp8" | ||
|
|
||
| # Disaggregation mode | ||
| disaggregation-mode: "decode" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.85 | ||
| max-prefill-tokens: 32768 | ||
| chunked-prefill-size: 32768 | ||
| context-length: 9600 | ||
| max-running-requests: 512 | ||
| cuda-graph-max-bs: 512 | ||
|
|
||
| # Parallelism | ||
| tensor-parallel-size: 8 | ||
| data-parallel-size: 1 | ||
| expert-parallel-size: 8 | ||
|
|
||
| # Attention | ||
| attention-backend: "trtllm_mla" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
||
| # MoE | ||
| moe-runner-backend: "flashinfer_trtllm" | ||
| # moe-dense-tp-size: 1 | ||
|
|
||
| # Other flags | ||
| stream-interval: 30 | ||
| watchdog-timeout: 1000000 | ||
| enable-flashinfer-allreduce-fusion: true | ||
| disable-radix-cache: true | ||
| # disable-chunked-prefix-cache: true | ||
|
|
||
| health_check: | ||
| max_attempts: 360 | ||
| interval_seconds: 10 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 8192 | ||
| osl: 1024 | ||
| req_rate: "inf" | ||
|
|
||
|
|
||
| # STP low-latency: tep8 decode (DP=1), scale sweep 1p1d/1p4d/1p6d | ||
| zip_override_stp_lowlat: | ||
| name: | ||
| - "b200-fp8-stp-low-latency-tep8-1p-1d" | ||
| - "b200-fp8-stp-low-latency-tep8-1p-4d" | ||
| - "b200-fp8-stp-low-latency-tep8-1p-6d" | ||
| resources: | ||
| decode_nodes: [1, 4, 6] | ||
| decode_workers: [1, 4, 6] | ||
| benchmark: | ||
| concurrencies: ["4x32x64", "64", "32"] | ||
|
|
||
|
|
||
| # MTP low-latency: same scales as STP, adds EAGLE speculative decoding | ||
| zip_override_mtp_lowlat: | ||
| name: | ||
| - "b200-fp8-mtp-low-latency-tep8-1p-1d" | ||
| - "b200-fp8-mtp-low-latency-tep8-1p-4d" | ||
| - "b200-fp8-mtp-low-latency-tep8-1p-6d" | ||
| resources: | ||
| decode_nodes: [1, 4, 6] | ||
| decode_workers: [1, 4, 6] | ||
| backend: | ||
| prefill_environment: | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| decode_environment: | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| sglang_config: | ||
| prefill: | ||
| moe-dense-tp-size: 1 | ||
| decode: | ||
| speculative-algorithm: "EAGLE" | ||
| speculative-num-steps: 2 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 3 | ||
| benchmark: | ||
| concurrencies: ["16x32x64", "8x256", "4x8x16x256"] | ||
|
|
||
|
|
||
| # STP max-throughput: dep8 decode (DP=8), scale sweep 1p1d and 2p1d | ||
| zip_override_stp_maxtpt: | ||
| name: | ||
| - "b200-fp8-stp-max-tpt-dep8-1p-1d" | ||
| - "b200-fp8-stp-max-tpt-dep8-2p-1d" | ||
| resources: | ||
| prefill_nodes: [1, 2] | ||
| prefill_workers: [1, 2] | ||
| decode_nodes: [1, 1] | ||
| decode_workers: [1, 1] | ||
| backend: | ||
| sglang_config: | ||
| prefill: | ||
| data-parallel-size: 8 | ||
| enable-dp-attention: true | ||
| enable-dp-lm-head: true | ||
| moe-dense-tp-size: 1 | ||
| max-running-requests: 1024 | ||
| decode: | ||
| data-parallel-size: 8 | ||
| enable-dp-attention: true | ||
| enable-dp-lm-head: true | ||
| moe-dense-tp-size: 1 | ||
| max-running-requests: 1024 | ||
| cuda-graph-max-bs: 1024 | ||
| benchmark: | ||
| concurrencies: ["128", "256"] | ||
|
|
||
|
|
||
| # MTP max-throughput: dep8 decode, scale sweep 1p1d/1p2d/2p1d, adds EAGLE speculative decoding | ||
| # Note: max-running-requests stays at 512 for MTP (unlike STP which raises to 1024) | ||
| zip_override_mtp_maxtpt: | ||
| name: | ||
| - "b200-fp8-mtp-max-tpt-dep8-1p-1d" | ||
| - "b200-fp8-mtp-max-tpt-dep8-1p-2d" | ||
| - "b200-fp8-mtp-max-tpt-dep8-2p-1d" | ||
| resources: | ||
| prefill_nodes: [1, 1, 2] | ||
| prefill_workers: [1, 1, 2] | ||
| decode_nodes: [1, 2, 1] | ||
| decode_workers: [1, 2, 1] | ||
| backend: | ||
| prefill_environment: | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| decode_environment: | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| sglang_config: | ||
| prefill: | ||
| data-parallel-size: 8 | ||
| enable-dp-attention: true | ||
| enable-dp-lm-head: true | ||
| moe-dense-tp-size: 1 | ||
| decode: | ||
| data-parallel-size: 8 | ||
| enable-dp-attention: true | ||
| enable-dp-lm-head: true | ||
| moe-dense-tp-size: 1 | ||
| speculative-algorithm: "EAGLE" | ||
| speculative-num-steps: 2 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 3 | ||
| benchmark: | ||
| concurrencies: ["256", "128x256x512x1024", "128x512"] |
There was a problem hiding this comment.
Override-format recipe is currently blocked by CI schema validation.
This file uses base + zip_override_*/override_*, but CI is validating it as a single concrete config, which causes the reported missing required fields and unknown fields. Please make recipe validation override-aware (detect override configs and validate expanded variants) before merge.
I can help draft the validator-side patch if you want.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@recipes/b200-fp8/8k1k.yaml` around lines 17 - 253, CI schema validation is
failing because this recipe uses a base + override pattern (the top-level "base"
block and override groups like "zip_override_stp_lowlat",
"zip_override_mtp_lowlat", "zip_override_stp_maxtpt", "zip_override_mtp_maxtpt")
but the validator treats the file as a single concrete config; update the
validator to detect files that define a "base" plus "override" or
"zip_override_*" keys and validate by expanding the base with each override
variant (or validating the base and each merged variant) rather than validating
the raw file shape, ensuring required fields in the expanded configs are present
and unknown-field errors are reported against the merged variants instead of the
override wrapper.
Reduce 40 individual recipe files to 8 override files (one per precision × isl × stp/mtp combination). Each file uses zip_override_scale to sweep all node-count variants, eliminating per-variant YAML duplication. FP4 8k1k files additionally use override_tp4 to cover the TP4 prefill mode alongside the default dep4 variants. Before: b200-fp8 (21 files) + b200-fp4 (19 files) = 40 files After: 8 override files covering all same variants recipes/b200-fp8/1k1k-stp.yaml (4 variants: 1p1d/1p3d low-lat, 1p5d/2p5d max-tpt) recipes/b200-fp8/1k1k-mtp.yaml (6 variants) recipes/b200-fp8/8k1k-stp.yaml (5 variants: 1p1d/1p4d/1p6d low-lat, 1p1d/2p1d max-tpt) recipes/b200-fp8/8k1k-mtp.yaml (6 variants) recipes/b200-fp4/1k1k-stp.yaml (4 variants: 1p5d/1p6d low-lat, 1p1d/1p2d max-tpt) recipes/b200-fp4/1k1k-mtp.yaml (4 variants) recipes/b200-fp4/8k1k-stp.yaml (5 dep4 variants + override_tp4) recipes/b200-fp4/8k1k-mtp.yaml (5 dep4 variants + override_tp4)
…ions
Recipe fixes:
- Move num_additional_frontends from resources: to frontend: in FP4 8k1k
files (was causing schema validation Unknown field error)
- Fix override_maxtpt_4p1d: use frontend: null to drop frontend config
(original file has no frontend section)
- Fix override_tp4: remove erroneous fp4-gemm-backend: null (original
tp4 file keeps flashinfer_trtllm backend), add decode expert-parallel-size: 1
- Separate low-lat and max-tpt into distinct zip_override_ groups so each
carries appropriate sglang_config overrides (DP=8, moe-dense-tp-size, etc.)
- FP4 1k1k MTP max-tpt: add per-variant mem-fraction-static list [0.75, 0.85]
- FP8 MTP max-tpt: keep max-running-requests=512 (STP raises to 1024, MTP does not)
- FP8 1k1k MTP: add override_maxtpt_1p2d special case with spec-steps=1, draft-tokens=2
Core fix:
- generate_override_configs: respect explicit name: field in override_* dicts
instead of always auto-generating {base_name}_{suffix}; add test coverage
Consolidate 8 separate *-stp.yaml / *-mtp.yaml files into 4 combined files (b200-fp8/1k1k.yaml, b200-fp8/8k1k.yaml, b200-fp4/1k1k.yaml, b200-fp4/8k1k.yaml). Override key names include stp/mtp labels (zip_override_stp_lowlat, zip_override_mtp_maxtpt, etc.) enabling wildcard selectors: srtctl apply -f recipes/b200-fp8/1k1k.yaml:*stp* # all STP variants srtctl apply -f recipes/b200-fp8/1k1k.yaml:*mtp* # all MTP variants FP4 8k1k 7p2d uses the null mechanism to combine STP and MTP into one zip_override_maxtpt_7p2d section: null in STP slots is a no-op (keys absent from base); values in MTP slots add SGLANG_ENABLE_SPEC_V2 and speculative settings on top of the same resources/sglang config.
…om old files - Replace zip_override_maxtpt_7p2d (null-mechanism combined) with explicit override_stp_maxtpt_7p2d and override_mtp_maxtpt_7p2d in b200-fp4/8k1k.yaml - Verified all 4 combined files produce configs identical to the original individual stp/mtp files (compared with Python deep-diff, excluding name field) - Add scale-sweep notes and backend notes from old files as section comments
Adds parameter grouping comments (# Model configuration, # Disaggregation mode, # Memory and token limits, # Parallelism, # Attention, # MoE, # Other flags) to the base sglang_config blocks in all four combined recipe files, matching the originals in the per-variant subdirectories. Also preserves commented-out hints (# moe-dense-tp-size: 1, # disable-chunked-prefix-cache: true) from the original files. All 40 variants verified equivalent to originals via diff script.
The 40 individual stp/mtp YAML files under b200-fp8/1k1k/, b200-fp8/8k1k/, b200-fp4/1k1k/, and b200-fp4/8k1k/ are now consolidated into 4 combined recipe files (one per precision×isl). All variants verified equivalent via diff script before deletion.
ac84b66 to
42af38e
Compare
Summary
b200-fp8/1k1k/stp|mtp/,b200-fp8/8k1k/stp|mtp/,b200-fp4/1k1k/stp|mtp/,b200-fp4/8k1k/stp|mtp/) into 4 combined recipe files, one per precision×islbasepluszip_override_*/override_*blocks to express STP and MTP variants with minimal duplication# Model configuration,# Disaggregation mode,# Memory and token limits,# Parallelism,# Attention,# MoE,# Other flags) matching the originalsTest plan
make checkpasses (336 tests, lint clean)srtctl dry-run -f recipes/b200-fp8/1k1k.yamlpreviews expected configssrtctl dry-run -f recipes/b200-fp4/8k1k.yamlpreviews expected configsSummary by CodeRabbit
Release Notes
New Features
Chores