Conversation
📝 WalkthroughWalkthroughAdds multiple new YAML deployment recipes for B200 GPU inference using SgLang, covering FP4 and FP8 precisions, various sequence lengths (1k1k, 8k1k), and deployment strategies (low-latency, max-throughput) with detailed resource, environment, sglang_config, health_check, and benchmark settings. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml`:
- Around line 64-70: The context-length (context-length: 2200) is too tight
given isl: 1024 + osl: 1024 = 2048 tokens and leaves only ~152 token headroom;
update the config to increase context-length (e.g., to 2560) to safely
accommodate special/system tokens or templating, or add a comment/validation in
the config-generation path ensuring the value is intentionally constrained;
refer to the context-length key and the isl/osl settings when making the change
so the new limit provides sufficient margin.
In `@recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml`:
- Around line 62-65: The context-length value (context-length: 2200) is too
small relative to max-prefill-tokens (32768) and will cap the model’s usable
sequence length; update the context-length in both the prefill and decode
sections to match or exceed the prefill capacity (e.g., 32768 or a larger
appropriate value for DeepSeek-R1) so the max-prefill-tokens limit is
attainable, or if 2200 is intentionally used for this benchmark add a clear
inline comment next to the context-length entries explaining the rationale.
🧹 Nitpick comments (6)
recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-1d.yaml (1)
120-122: Remove or document the commented-outmoe-dense-tp-size.Line 122 has a commented-out
# moe-dense-tp-size: 1. If it's intentionally omitted for decode, a brief comment explaining why would help; otherwise, clean it up to avoid confusion across the 16 new config files.recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p1d.yaml (1)
16-48: Consider reducing environment block duplication across recipe files.The
prefill_environmentanddecode_environmentblocks are identical across all 7 FP8 STP configs in this PR. If the recipe loader supports YAML anchors (&/*) or an include/template mechanism, extracting these into a shared base would significantly reduce maintenance overhead when an env var needs to change.Not a blocker — just something to consider as the recipe count grows.
recipes/b200-fp4/1k1k/stp/low-latency-dep4-1p-tep8-6d.yaml (1)
17-51: Inconsistent value quoting and boolean representations across environment variables.Minor style nit: most values are quoted strings, but
DYN_REQUEST_PLANE: nats(lines 33, 51) is unquoted. Additionally, boolean-like values mix representations:"True"(line 26/44),"false"(line 22/39),"1"/"0"(lines 27–32, etc.). While functionally fine for YAML parsing, inconsistent casing/quoting can cause subtle bugs if the consuming application is case-sensitive (e.g.,"True"vs"true"in Python env var checks).recipes/b200-fp8/1k1k/stp/low-latency-tep8-1p3d.yaml (2)
31-31: Consider quoting the NATS value for consistency.The
DYN_REQUEST_PLANE: natsentries are unquoted, while other string values in the environment sections are quoted. While YAML permits unquoted strings, quoting ensures consistent parsing and avoids potential issues with special values.✨ Suggested change for consistency
- DYN_REQUEST_PLANE: nats + DYN_REQUEST_PLANE: "nats"Apply this in both prefill_environment (line 31) and decode_environment (line 48).
Also applies to: 48-48
80-80: Clarify the purpose of commented moe-dense-tp-size.The
# moe-dense-tp-size: 1parameter is commented out in both prefill and decode sections. If this parameter is not needed for the current configuration, consider removing it to reduce clutter. If it might be needed for future tuning, add a brief comment explaining when it should be enabled.Also applies to: 117-117
recipes/b200-fp4/8k1k/stp/low-latency-dep4-1p-tep8-5d.yaml (1)
122-122: Commented-outmoe-dense-tp-sizein decode config.This line is commented out in both low-latency decode configs but active in the max-throughput decode configs. If the intent is to rely on a default value, consider removing the comment entirely to avoid confusion. If it's a deliberate tuning choice for low-latency, a brief note explaining why would be helpful.
Summary by CodeRabbit