Add B200 FP8/FP4 STP configs by elvischenv · Pull Request #162 · ishandhanani/srt-slurm

elvischenv · 2026-02-07T01:54:22Z

Summary by CodeRabbit

New Features
- Added a broad set of deployment configuration templates for FP4 and FP8 inference, covering low‑latency and max‑throughput scenarios.
- Templates include varied prefill/decode topologies (single/multi‑node, multi‑GPU), resource sizing, health checks, and benchmark presets for performance testing.

coderabbitai · 2026-02-07T01:54:58Z

📝 Walkthrough

Walkthrough

Adds multiple new YAML deployment recipes for B200 GPU inference using SgLang, covering FP4 and FP8 precisions, various sequence lengths (1k1k, 8k1k), and deployment strategies (low-latency, max-throughput) with detailed resource, environment, sglang_config, health_check, and benchmark settings.

Changes

Cohort / File(s)	Summary
FP4 1k1k Low-Latency `recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-5d.yaml`, `recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml`	New low-latency FP4 deployment recipes: model, resources, prefill/decode environments, detailed `sglang_config` (prefill/decode), health_check, and sa-bench benchmark settings.
FP4 1k1k Max-Throughput `recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-1d.yaml`, `recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-2d.yaml`	New max-throughput FP4 recipes with tuned parallelism, environment vars (NCCL/MNNVL/SGLANG), `sglang_config` for prefill/decode, health_check, and benchmark sections.
FP4 8k1k Low-Latency `recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml`, `recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yaml`, `recipes/b200-fp4/8k1k/stp/low-latency-dep4-2p-tep8-5d.yaml`	Adds 8k1k FP4 low-latency recipes with Dynamo/frontend flags, extensive prefill/decode envs, MoE and FP4-specific backend flags, and benchmark configs.
FP4 8k1k Max-Throughput `recipes/b200-fp4/8k1k/stp/max-tpt-dep4-4p-dep8-1d.yaml`, `recipes/b200-fp4/8k1k/stp/max-tpt-dep4-7p-dep8-2d.yaml`	New high-parallelism FP4 max-throughput recipes covering resource allocations, env tuning, and `sglang_config` for prefill/decode.
FP8 1k1k Low-Latency `recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yaml`, `recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml`	New FP8 low-latency recipes with prefill/decode envs, attention backend and kv-cache dtype settings, `sglang_config`, health_check, and benchmarks.
FP8 1k1k Max-Throughput `recipes/b200-fp8/1k1k/stp/max-tpt-dep8-1p5d.yaml`, `recipes/b200-fp8/1k1k/stp/max-tpt-dep8-2p5d.yaml`	Adds FP8 max-throughput configs with multi-node prefill/decode resource layouts, memory pool/env tuning, and `sglang_config` parallelism/MoE settings.
FP8 8k1k Low-Latency `recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p1d.yaml`, `recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p4d.yaml`, `recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p6d.yaml`	New FP8 8k1k low-latency recipes with disaggregation prefill/decode configs, trtllm_mla attention settings, MoE flags, and benchmarks.
FP8 8k1k Max-Throughput `recipes/b200-fp8/8k1k/stp/max-tpt-dep8-1p1d.yaml`, `recipes/b200-fp8/8k1k/stp/max-tpt-dep8-2p1d.yaml`	Adds FP8 max-throughput recipes with env/parallelism tuning, `sglang_config` for both stages, health_check, and benchmark entries.
FP4 TP4/TP8 Mixed `recipes/b200-fp4/8k1k/stp/low-latency-tp4-1p-tp8-1d.yaml`	New mixed TP4 (prefill) / TP8 (decode) FP4 low-latency recipe with corresponding envs and `sglang_config`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

Update gb200 recipes #130 — Adds/edits deployment recipes and SGLANG-related environment keys similar to this change (container images, sglang flags).
Update GB200-FP4 1k/8k configs #103 — Modifies FP4 recipe files and sglang_config keys (context-length, disaggregation-transfer-backend, fp4 gemm backend) that align with entries in these new configs.

Suggested reviewers

ishandhanani
kyleliang-nv

Poem

🐰 New recipes bloom in YAML rows,
GPUs hum where the warm wind blows,
FP4, FP8 — parallel cheer,
Prefill, decode, benchmarks steer,
Hop, tune, deploy — the rabbit knows! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary change: adding B200 FP8/FP4 STP configuration files. It is concise, clear, and directly reflects the changeset content.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

recipes/b200-fp4/8k1k/stp/low-latency-tp4-1p-tp8-1d.yaml (1)

26-59: Consider using YAML anchors to reduce environment duplication.

The prefill_environment and decode_environment blocks are nearly identical (the only addition in decode is SGLANG_DECODE_BOOTSTRAP_TIMEOUT). YAML anchors (& / <<: *) would eliminate the duplication and reduce the risk of the blocks drifting out of sync.

recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml (1)

131-131: Commented-out config moe-dense-tp-size in decode block.

If this was intentionally removed for decode, consider deleting the line entirely rather than leaving it commented out—it reduces noise when comparing configs.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml`:
- Around line 64-70: The context-length (context-length: 2200) is too tight
given isl: 1024 + osl: 1024 = 2048 tokens and leaves only ~152 token headroom;
update the config to increase context-length (e.g., to 2560) to safely
accommodate special/system tokens or templating, or add a comment/validation in
the config-generation path ensuring the value is intentionally constrained;
refer to the context-length key and the isl/osl settings when making the change
so the new limit provides sufficient margin.

In `@recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml`:
- Around line 62-65: The context-length value (context-length: 2200) is too
small relative to max-prefill-tokens (32768) and will cap the model’s usable
sequence length; update the context-length in both the prefill and decode
sections to match or exceed the prefill capacity (e.g., 32768 or a larger
appropriate value for DeepSeek-R1) so the max-prefill-tokens limit is
attainable, or if 2200 is intentionally used for this benchmark add a clear
inline comment next to the context-length entries explaining the rationale.

🧹 Nitpick comments (6)

recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml (1)

120-122: Remove or document the commented-out moe-dense-tp-size.

Line 122 has a commented-out # moe-dense-tp-size: 1. If it's intentionally omitted for decode, a brief comment explaining why would help; otherwise, clean it up to avoid confusion across the 16 new config files.

recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yaml (1)

16-48: Consider reducing environment block duplication across recipe files.

The prefill_environment and decode_environment blocks are identical across all 7 FP8 STP configs in this PR. If the recipe loader supports YAML anchors (&/*) or an include/template mechanism, extracting these into a shared base would significantly reduce maintenance overhead when an env var needs to change.

Not a blocker — just something to consider as the recipe count grows.

recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml (1)

17-51: Inconsistent value quoting and boolean representations across environment variables.

Minor style nit: most values are quoted strings, but DYN_REQUEST_PLANE: nats (lines 33, 51) is unquoted. Additionally, boolean-like values mix representations: "True" (line 26/44), "false" (line 22/39), "1" / "0" (lines 27–32, etc.). While functionally fine for YAML parsing, inconsistent casing/quoting can cause subtle bugs if the consuming application is case-sensitive (e.g., "True" vs "true" in Python env var checks).
recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml (2)
31-31: Consider quoting the NATS value for consistency.

The DYN_REQUEST_PLANE: nats entries are unquoted, while other string values in the environment sections are quoted. While YAML permits unquoted strings, quoting ensures consistent parsing and avoids potential issues with special values.
✨ Suggested change for consistency
-    DYN_REQUEST_PLANE: nats
+    DYN_REQUEST_PLANE: "nats"
Apply this in both prefill_environment (line 31) and decode_environment (line 48).
Also applies to: 48-48

80-80: Clarify the purpose of commented moe-dense-tp-size.

The # moe-dense-tp-size: 1 parameter is commented out in both prefill and decode sections. If this parameter is not needed for the current configuration, consider removing it to reduce clutter. If it might be needed for future tuning, add a brief comment explaining when it should be enabled.

Also applies to: 117-117
recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yaml (1)

122-122: Commented-out moe-dense-tp-size in decode config.

This line is commented out in both low-latency decode configs but active in the max-throughput decode configs. If the intent is to rely on a default value, consider removing the comment entirely to avoid confusion. If it's a deliberate tuning choice for low-latency, a brief note explaining why would be helpful.

elvischenv added 3 commits February 6, 2026 17:38

b200 fp8 stp init config

658d9b9

b200 fp4 stp init config

4c6ccaf

update container name

c5eb751

coderabbitai bot reviewed Feb 7, 2026

View reviewed changes

Comment thread recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml

Comment thread recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml

Add B200-FP4-8k1k-STP configs

0642b78

elvischenv closed this Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B200 FP8/FP4 STP configs#162

Add B200 FP8/FP4 STP configs#162
elvischenv wants to merge 4 commits intoishandhanani:mainfrom
elvischenv:elvischenv/b200-config

elvischenv commented Feb 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elvischenv commented Feb 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elvischenv commented Feb 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 7, 2026 •

edited

Loading