Skip to content

Add B200 FP8/FP4 STP configs#162

Closed
elvischenv wants to merge 4 commits intoishandhanani:mainfrom
elvischenv:elvischenv/b200-config
Closed

Add B200 FP8/FP4 STP configs#162
elvischenv wants to merge 4 commits intoishandhanani:mainfrom
elvischenv:elvischenv/b200-config

Conversation

@elvischenv
Copy link
Copy Markdown
Contributor

@elvischenv elvischenv commented Feb 7, 2026

Summary by CodeRabbit

  • New Features
    • Added a broad set of deployment configuration templates for FP4 and FP8 inference, covering low‑latency and max‑throughput scenarios.
    • Templates include varied prefill/decode topologies (single/multi‑node, multi‑GPU), resource sizing, health checks, and benchmark presets for performance testing.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 7, 2026

📝 Walkthrough

Walkthrough

Adds multiple new YAML deployment recipes for B200 GPU inference using SgLang, covering FP4 and FP8 precisions, various sequence lengths (1k1k, 8k1k), and deployment strategies (low-latency, max-throughput) with detailed resource, environment, sglang_config, health_check, and benchmark settings.

Changes

Cohort / File(s) Summary
FP4 1k1k Low-Latency
recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-5d.yaml, recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml
New low-latency FP4 deployment recipes: model, resources, prefill/decode environments, detailed sglang_config (prefill/decode), health_check, and sa-bench benchmark settings.
FP4 1k1k Max-Throughput
recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-1d.yaml, recipes/b200-fp4/1k1k/stp/max-tpt-dep4-1p-dep8-2d.yaml
New max-throughput FP4 recipes with tuned parallelism, environment vars (NCCL/MNNVL/SGLANG), sglang_config for prefill/decode, health_check, and benchmark sections.
FP4 8k1k Low-Latency
recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml, recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yaml, recipes/b200-fp4/8k1k/stp/low-latency-dep4-2p-tep8-5d.yaml
Adds 8k1k FP4 low-latency recipes with Dynamo/frontend flags, extensive prefill/decode envs, MoE and FP4-specific backend flags, and benchmark configs.
FP4 8k1k Max-Throughput
recipes/b200-fp4/8k1k/stp/max-tpt-dep4-4p-dep8-1d.yaml, recipes/b200-fp4/8k1k/stp/max-tpt-dep4-7p-dep8-2d.yaml
New high-parallelism FP4 max-throughput recipes covering resource allocations, env tuning, and sglang_config for prefill/decode.
FP8 1k1k Low-Latency
recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yaml, recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml
New FP8 low-latency recipes with prefill/decode envs, attention backend and kv-cache dtype settings, sglang_config, health_check, and benchmarks.
FP8 1k1k Max-Throughput
recipes/b200-fp8/1k1k/stp/max-tpt-dep8-1p5d.yaml, recipes/b200-fp8/1k1k/stp/max-tpt-dep8-2p5d.yaml
Adds FP8 max-throughput configs with multi-node prefill/decode resource layouts, memory pool/env tuning, and sglang_config parallelism/MoE settings.
FP8 8k1k Low-Latency
recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p1d.yaml, recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p4d.yaml, recipes/b200-fp8/8k1k/stp/low-latency-tep8-1p6d.yaml
New FP8 8k1k low-latency recipes with disaggregation prefill/decode configs, trtllm_mla attention settings, MoE flags, and benchmarks.
FP8 8k1k Max-Throughput
recipes/b200-fp8/8k1k/stp/max-tpt-dep8-1p1d.yaml, recipes/b200-fp8/8k1k/stp/max-tpt-dep8-2p1d.yaml
Adds FP8 max-throughput recipes with env/parallelism tuning, sglang_config for both stages, health_check, and benchmark entries.
FP4 TP4/TP8 Mixed
recipes/b200-fp4/8k1k/stp/low-latency-tp4-1p-tp8-1d.yaml
New mixed TP4 (prefill) / TP8 (decode) FP4 low-latency recipe with corresponding envs and sglang_config.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Possibly related PRs

  • Update gb200 recipes #130 — Adds/edits deployment recipes and SGLANG-related environment keys similar to this change (container images, sglang flags).
  • Update GB200-FP4 1k/8k configs #103 — Modifies FP4 recipe files and sglang_config keys (context-length, disaggregation-transfer-backend, fp4 gemm backend) that align with entries in these new configs.

Suggested reviewers

  • ishandhanani
  • kyleliang-nv

Poem

🐰 New recipes bloom in YAML rows,
GPUs hum where the warm wind blows,
FP4, FP8 — parallel cheer,
Prefill, decode, benchmarks steer,
Hop, tune, deploy — the rabbit knows! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: adding B200 FP8/FP4 STP configuration files. It is concise, clear, and directly reflects the changeset content.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
recipes/b200-fp4/8k1k/stp/low-latency-tp4-1p-tp8-1d.yaml (1)

26-59: Consider using YAML anchors to reduce environment duplication.

The prefill_environment and decode_environment blocks are nearly identical (the only addition in decode is SGLANG_DECODE_BOOTSTRAP_TIMEOUT). YAML anchors (& / <<: *) would eliminate the duplication and reduce the risk of the blocks drifting out of sync.

recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml (1)

131-131: Commented-out config moe-dense-tp-size in decode block.

If this was intentionally removed for decode, consider deleting the line entirely rather than leaving it commented out—it reduces noise when comparing configs.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml`:
- Around line 64-70: The context-length (context-length: 2200) is too tight
given isl: 1024 + osl: 1024 = 2048 tokens and leaves only ~152 token headroom;
update the config to increase context-length (e.g., to 2560) to safely
accommodate special/system tokens or templating, or add a comment/validation in
the config-generation path ensuring the value is intentionally constrained;
refer to the context-length key and the isl/osl settings when making the change
so the new limit provides sufficient margin.

In `@recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml`:
- Around line 62-65: The context-length value (context-length: 2200) is too
small relative to max-prefill-tokens (32768) and will cap the model’s usable
sequence length; update the context-length in both the prefill and decode
sections to match or exceed the prefill capacity (e.g., 32768 or a larger
appropriate value for DeepSeek-R1) so the max-prefill-tokens limit is
attainable, or if 2200 is intentionally used for this benchmark add a clear
inline comment next to the context-length entries explaining the rationale.
🧹 Nitpick comments (6)
recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml (1)

120-122: Remove or document the commented-out moe-dense-tp-size.

Line 122 has a commented-out # moe-dense-tp-size: 1. If it's intentionally omitted for decode, a brief comment explaining why would help; otherwise, clean it up to avoid confusion across the 16 new config files.

recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yaml (1)

16-48: Consider reducing environment block duplication across recipe files.

The prefill_environment and decode_environment blocks are identical across all 7 FP8 STP configs in this PR. If the recipe loader supports YAML anchors (&/*) or an include/template mechanism, extracting these into a shared base would significantly reduce maintenance overhead when an env var needs to change.

Not a blocker — just something to consider as the recipe count grows.

recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml (1)

17-51: Inconsistent value quoting and boolean representations across environment variables.

Minor style nit: most values are quoted strings, but DYN_REQUEST_PLANE: nats (lines 33, 51) is unquoted. Additionally, boolean-like values mix representations: "True" (line 26/44), "false" (line 22/39), "1" / "0" (lines 27–32, etc.). While functionally fine for YAML parsing, inconsistent casing/quoting can cause subtle bugs if the consuming application is case-sensitive (e.g., "True" vs "true" in Python env var checks).

recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml (2)

31-31: Consider quoting the NATS value for consistency.

The DYN_REQUEST_PLANE: nats entries are unquoted, while other string values in the environment sections are quoted. While YAML permits unquoted strings, quoting ensures consistent parsing and avoids potential issues with special values.

✨ Suggested change for consistency
-    DYN_REQUEST_PLANE: nats
+    DYN_REQUEST_PLANE: "nats"

Apply this in both prefill_environment (line 31) and decode_environment (line 48).

Also applies to: 48-48


80-80: Clarify the purpose of commented moe-dense-tp-size.

The # moe-dense-tp-size: 1 parameter is commented out in both prefill and decode sections. If this parameter is not needed for the current configuration, consider removing it to reduce clutter. If it might be needed for future tuning, add a brief comment explaining when it should be enabled.

Also applies to: 117-117

recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yaml (1)

122-122: Commented-out moe-dense-tp-size in decode config.

This line is commented out in both low-latency decode configs but active in the max-throughput decode configs. If the intent is to rely on a default value, consider removing the comment entirely to avoid confusion. If it's a deliberate tuning choice for low-latency, a brief note explaining why would be helpful.

Comment thread recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml
Comment thread recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml
@elvischenv elvischenv closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants