Skip to content

Add GB200 FP8 1k/8k configs#115

Open
kyleliang-nv wants to merge 3 commits intomainfrom
kylliang/gb200-fp8-1k8k
Open

Add GB200 FP8 1k/8k configs#115
kyleliang-nv wants to merge 3 commits intomainfrom
kylliang/gb200-fp8-1k8k

Conversation

@kyleliang-nv
Copy link
Copy Markdown
Collaborator

@kyleliang-nv kyleliang-nv commented Jan 28, 2026

Summary by CodeRabbit

  • New Features
    • Added three deployment profiles for gb200-fp8 (low-latency, maximum-throughput, mid-curve) to choose optimized inference behavior.
    • Each profile includes tuned runtime and memory/performance controls, separate prefill vs decode optimizations, and built-in benchmarking settings to evaluate latency and throughput.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 28, 2026

📝 Walkthrough

Walkthrough

Adds three new deployment YAMLs for gb200-fp8 1k/8k: low-latency, max-tpt, and mid-curve. Each file configures Dynamo frontend topology, model/container/precision, resource allocations, and detailed SGLang backend settings for separate prefill and decode modes plus benchmark parameters.

Changes

Cohort / File(s) Summary
GB200-FP8 1K/8K Configuration Files
recipes/gb200-fp8/1k8k/low-latency.yaml, recipes/gb200-fp8/1k8k/max-tpt.yaml, recipes/gb200-fp8/1k8k/mid-curve.yaml
Added three complete deployment configs. Each defines Dynamo frontend (multi-frontend support), model metadata (path, container, precision), resource allocations (gpu_type, prefill/decode nodes/workers, gpus_per_node), and extensive SGLang backend configs split for prefill vs decode (served-model-name, trust-remote-code, kv-cache-dtype, attention/backend, quantization, moa/runner, disaggregation-mode, tensor/data/expert parallelism, mem-fraction, CUDA-graph and DeepEP/MOE tuning). Includes benchmark blocks (concurrency, req_rate, sa-bench) and numerous performance/timeout/cache settings requiring validation across prefill/decode and resource params.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • ishandhanani

Poem

🐇 A rabbit hops through YAML fields bright,
Low-latency, max-tpt, mid-curve take flight,
Prefill and decode in careful array,
SGLang whispers tweaks through night and day,
Hooray for configs tuned just right! 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding three new YAML configuration files (low-latency.yaml, max-tpt.yaml, mid-curve.yaml) for GB200 FP8 1k/8k deployment specifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k8k/low-latency.yaml`:
- Line 9: The inline comment for the YAML key num_additional_frontends is
truncated; update the comment to a complete sentence clarifying the meaning
(e.g., complete the fragment "# Additional routers (total = 1 + t" to something
like "# Additional routers (total = 1 + num_additional_frontends)" or a
similarly clear description) so anyone reading the key understands how the total
router count is computed; locate the num_additional_frontends entry and replace
the truncated comment with the full explanatory text.
- Line 29: Update the SGLANG_DG_CACHE_DIR value to use the absolute path to
match the other recipes and avoid working-directory dependent behavior: locate
the SGLANG_DG_CACHE_DIR entry in this file
(recipes/gb200-fp8/1k8k/low-latency.yaml) and change the value from
"configs/dg-0.5.5.post2" to "/configs/dg-0.5.5.post2" so it is consistent with
mid-curve.yaml and max-tpt.yaml.
- Line 47: Update the SGLANG_DG_CACHE_DIR value in this file to use the same
corrected path used in decode_environment across the other config files; locate
the SGLANG_DG_CACHE_DIR entry and replace its current path with the exact
canonical path string used elsewhere so the decode_environment lookup is
consistent with the other configurations.

In `@recipes/gb200-fp8/1k8k/max-tpt.yaml`:
- Line 12: The inline comment for num_additional_frontends is truncated; update
the comment to a complete explanatory sentence such as "Additional routers
(total = 1 + num_additional_frontends)" or "Additional routers (total routers =
1 + num_additional_frontends)" next to the num_additional_frontends key so the
intent is clear.

In `@recipes/gb200-fp8/1k8k/mid-curve.yaml`:
- Line 12: The inline comment for the YAML key num_additional_frontends is
truncated; update the comment for num_additional_frontends to complete the
explanatory text (e.g., "Additional routers (total = 1 +
num_additional_frontends)") so it clearly states how the total routers is
computed and what the value represents.

Comment thread recipes/gb200-fp8/1k8k/low-latency.yaml Outdated
Comment thread recipes/gb200-fp8/1k8k/low-latency.yaml Outdated
Comment thread recipes/gb200-fp8/1k8k/low-latency.yaml Outdated
Comment thread recipes/gb200-fp8/1k8k/max-tpt.yaml Outdated
Comment thread recipes/gb200-fp8/1k8k/mid-curve.yaml Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant