Conversation
📝 WalkthroughWalkthroughSix new YAML deployment profiles were added for FP8 inference on GB300 hardware, covering low-latency, mid, and max-throughput targets for two model sizes (1k1k and 8k1k). Each file encodes resources, environment hooks, sglang_config, and benchmarking parameters. Changes
Sequence Diagram(s)(Skipped — changes are configuration files without new multi-component control flow requiring sequence visualization.) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@recipies/gb300-fp8/1k1k/stp/max.yaml`:
- Around line 1-3: The comment header at the top of the recipe incorrectly reads
"GB200"; update that comment to reference "GB300" so it matches the
configuration name "gb300-1k1k-fp8-max" (keep the rest of the header text
unchanged) — locate the top-of-file comment and change "GB200" to "GB300".
In `@recipies/gb300-fp8/1k1k/stp/mid.yaml`:
- Around line 1-2: Update the top comment to match the actual config: replace
the incorrect header "GB200 FP8 Max Throughput Configuration" with a correct
description reflecting GB300 and mid-tier throughput (for example "GB300 FP8 Mid
Throughput Configuration"), ensuring it aligns with the config name field name:
"gb300-1k1k-fp8-mid".
In `@recipies/gb300-fp8/8k1k/stp/low-latency.yaml`:
- Around line 118-123: The benchmark block has isl set to 8102 (benchmark: isl:
8102) which likely is a typo compared to other 8k1k configs that use 8192;
verify the intended target and if it's a mistake, update the isl value to 8192
in the same benchmark section (and ensure req_rate, osl and concurrencies remain
unchanged), and run any config validation or tests to confirm the change; if
8102 was intentional, add an inline comment explaining why it differs from the
8192 baseline.
In `@recipies/gb300-fp8/8k1k/stp/max.yaml`:
- Around line 1-3: The top-line comment is inconsistent with the configuration
name/gpu_type: update the header comment "# GB200 FP8 Max Throughput
Configuration" to reflect "GB300" so it matches the configuration name
"gb300-8k1k-fp8-max" (or alternatively rename the configuration/name/gpu_type to
GB200 if that was intended); ensure the header text and the symbol name
"gb300-8k1k-fp8-max" (and any gpu_type field) are consistent.
In `@recipies/gb300-fp8/8k1k/stp/mid.yaml`:
- Around line 1-3: The comment header is incorrect: replace "GB200 FP8 Max
Throughput Configuration" with a header that matches this recipe (GB300,
mid-throughput). Update the top-of-file comment to something like "GB300 FP8 Mid
Throughput Configuration" so it aligns with the name field "gb300-8k1k-fp8-mid"
and the file's intended throughput tier.
🧹 Nitpick comments (1)
recipies/gb300-fp8/8k1k/stp/low-latency.yaml (1)
61-116: Configuration uses different parameter naming style.This file uses
tensor-parallel-size,data-parallel-size,expert-parallel-sizewhile other configurations in this PR use the shortertp-size,dp-size,ep-sizestyle. Both likely work, but consider standardizing for maintainability across configurations.
configs for gb300-fp8-no-mtp
configs for gb300-fp8-no-mtp
* Merge pull request #118 from ishandhanani/grho/Jan29_a configs for gb300-fp8-no-mtp * Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container and specify nginx container * Add GB200-FP8-1k8k * Update GB200 FP8 1k8k recipes * typo * only build for 9.0 * go * go * again * try again * go * Update gb200 recipes (#130) * Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile * fix container * clean up old * Add 1k1k STP and MTP disagg H100 configs (#140) * Add 1k1k STP and MTP disagg H100 configs * Update H100 FP8 configs with verified 29 Pareto-optimal points Replace previous configs with verified Pareto-optimal configurations: - 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) - 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308) Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1 * Update H100 configs to tensorrtllm-runtime:0.8.1.post3 Update all 29 H100 FP8 config files to use the new container: - nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3 * updates the recipe for Dynamo-SGLang B200 submissions * adds modified B200-fp8 recipes * updates the recipes * prune the concurrency * Add B200 MTP FP4 SGLANG recipes * Update model path cand container Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> * modify b200 sgl fp4 non-mtp configs (#168) * adds conc=128 point * adds 1p2d config * modify job name to support multiple gh runners (#182) * Add resolved B200 FP8 8k1k recipe variants for CI compatibility 14 standalone recipe files resolved from the consolidated 8k1k.yaml (main branch) for use with the sa-submission-q1-2026 srtctl which does not support zip_override syntax. STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) Made-with: Cursor * Bump MTP 8k1k health check timeout from 60min to 120min EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min of CUDA graph capture alone on the decode worker. Combined with model loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init exceeds the 60min (360 attempts x 10s) health check window on cold nodes. Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP recipe variants to provide sufficient headroom. Made-with: Cursor * Fix cuda-graph-max-bs on MTP maxtpt decode workers With data-parallel-size=8 and dp-attention, the scheduler distributes requests across 8 DP replicas. Each replica only sees max-running-requests/dp concurrent sequences, so cuda-graph-max-bs should be divided by dp accordingly. Previous values caused CUDA graph capture of 99 batch sizes per DP replica with EAGLE speculative decoding, taking 80+ minutes and exceeding the health check timeout. Corrected values capture only 35 batch sizes, finishing in ~1 minute with no performance regression. Validated: MTP 3P1D output throughput 15,124 tok/s matches reference 14,995 tok/s (+0.9%). maxtpt_0: 128 -> 16 (max-running=128, dp=8) maxtpt_1: 256 -> 32 (max-running=256, dp=8) maxtpt_2: 512 -> 64 (max-running=512, dp=8) maxtpt_3: 1024 -> 128 (max-running=1024, dp=8) Made-with: Cursor * fix rebase --------- Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com> Co-authored-by: Kyle Liang <kylliang@nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: nlevin-ui <nlevin@nvidia.com> Co-authored-by: Elnifio <elnifio0519@gmail.com> Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.