configs for gb300-fp8-no-mtp by gracehonv · Pull Request #118 · ishandhanani/srt-slurm

gracehonv · 2026-01-29T20:05:28Z

Summary by CodeRabbit

New Features
- Added GB300 FP8 deployment profiles for low-latency, mid-throughput, and max-throughput modes covering 1k/8k sequence variants.
- Included performance/benchmark presets and CUDA-graph tuning options for prefill and decode phases.
Chores
- Added comprehensive environment and resource configuration sets to support FP8 inference on GB300 hardware.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-29T20:06:16Z

📝 Walkthrough

Walkthrough

Six new YAML deployment profiles were added for FP8 inference on GB300 hardware, covering low-latency, mid, and max-throughput targets for two model sizes (1k1k and 8k1k). Each file encodes resources, environment hooks, sglang_config, and benchmarking parameters.

Changes

Cohort / File(s)	Summary
1k1k FP8 Deployment Profiles `recipies/gb300-fp8/1k1k/stp/low-latency.yaml`, `recipies/gb300-fp8/1k1k/stp/max.yaml`, `recipies/gb300-fp8/1k1k/stp/mid.yaml`	Added three new configs for 1k1k FP8: resource allocation (prefill/decode nodes, workers, GPUs), backend env hooks, detailed `sglang_config` for prefill/decode (model, TP/DP/EP, kv-cache dtype, attention backend, disaggregation, DeepEP/MoE), CUDA-graph and benchmark settings.
8k1k FP8 Deployment Profiles `recipies/gb300-fp8/8k1k/stp/low-latency.yaml`, `recipies/gb300-fp8/8k1k/stp/max.yaml`, `recipies/gb300-fp8/8k1k/stp/mid.yaml`	Added three new configs for 8k1k FP8: same structure as 1k1k set but tuned for larger model (resource counts, parallelism, memory/disaggregation, DeepEP/MoE options), plus CUDA-graph and benchmark parameters.

Sequence Diagram(s)

(Skipped — changes are configuration files without new multi-component control flow requiring sequence visualization.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

ishandhanani

Poem

🐰 I hopped through YAML rows and leaves,
Tucked FP8 dreams beneath GB300 eaves.
Low-latency whispers, throughput sings—
Six new paths for model wings. 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The PR title 'configs for gb300-fp8-no-mtp' is vague and generic, using non-descriptive phrasing that doesn't clearly convey the specific changes made.	Consider using a more descriptive title that clearly indicates the main change, such as 'Add GB300 FP8 configuration files for low-latency, mid, and max throughput profiles' to better reflect the actual additions.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@recipies/gb300-fp8/1k1k/stp/max.yaml`:
- Around line 1-3: The comment header at the top of the recipe incorrectly reads
"GB200"; update that comment to reference "GB300" so it matches the
configuration name "gb300-1k1k-fp8-max" (keep the rest of the header text
unchanged) — locate the top-of-file comment and change "GB200" to "GB300".

In `@recipies/gb300-fp8/1k1k/stp/mid.yaml`:
- Around line 1-2: Update the top comment to match the actual config: replace
the incorrect header "GB200 FP8 Max Throughput Configuration" with a correct
description reflecting GB300 and mid-tier throughput (for example "GB300 FP8 Mid
Throughput Configuration"), ensuring it aligns with the config name field name:
"gb300-1k1k-fp8-mid".

In `@recipies/gb300-fp8/8k1k/stp/low-latency.yaml`:
- Around line 118-123: The benchmark block has isl set to 8102 (benchmark: isl:
8102) which likely is a typo compared to other 8k1k configs that use 8192;
verify the intended target and if it's a mistake, update the isl value to 8192
in the same benchmark section (and ensure req_rate, osl and concurrencies remain
unchanged), and run any config validation or tests to confirm the change; if
8102 was intentional, add an inline comment explaining why it differs from the
8192 baseline.

In `@recipies/gb300-fp8/8k1k/stp/max.yaml`:
- Around line 1-3: The top-line comment is inconsistent with the configuration
name/gpu_type: update the header comment "# GB200 FP8 Max Throughput
Configuration" to reflect "GB300" so it matches the configuration name
"gb300-8k1k-fp8-max" (or alternatively rename the configuration/name/gpu_type to
GB200 if that was intended); ensure the header text and the symbol name
"gb300-8k1k-fp8-max" (and any gpu_type field) are consistent.

In `@recipies/gb300-fp8/8k1k/stp/mid.yaml`:
- Around line 1-3: The comment header is incorrect: replace "GB200 FP8 Max
Throughput Configuration" with a header that matches this recipe (GB300,
mid-throughput). Update the top-of-file comment to something like "GB300 FP8 Mid
Throughput Configuration" so it aligns with the name field "gb300-8k1k-fp8-mid"
and the file's intended throughput tier.

🧹 Nitpick comments (1)

recipies/gb300-fp8/8k1k/stp/low-latency.yaml (1)

61-116: Configuration uses different parameter naming style.

This file uses tensor-parallel-size, data-parallel-size, expert-parallel-size while other configurations in this PR use the shorter tp-size, dp-size, ep-size style. Both likely work, but consider standardizing for maintainability across configurations.

configs for gb300-fp8-no-mtp

* Merge pull request #118 from ishandhanani/grho/Jan29_a configs for gb300-fp8-no-mtp * Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container and specify nginx container * Add GB200-FP8-1k8k * Update GB200 FP8 1k8k recipes * typo * only build for 9.0 * go * go * again * try again * go * Update gb200 recipes (#130) * Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile * fix container * clean up old * Add 1k1k STP and MTP disagg H100 configs (#140) * Add 1k1k STP and MTP disagg H100 configs * Update H100 FP8 configs with verified 29 Pareto-optimal points Replace previous configs with verified Pareto-optimal configurations: - 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) - 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308) Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1 * Update H100 configs to tensorrtllm-runtime:0.8.1.post3 Update all 29 H100 FP8 config files to use the new container: - nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3 * updates the recipe for Dynamo-SGLang B200 submissions * adds modified B200-fp8 recipes * updates the recipes * prune the concurrency * Add B200 MTP FP4 SGLANG recipes * Update model path cand container Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> * modify b200 sgl fp4 non-mtp configs (#168) * adds conc=128 point * adds 1p2d config * modify job name to support multiple gh runners (#182) * Add resolved B200 FP8 8k1k recipe variants for CI compatibility 14 standalone recipe files resolved from the consolidated 8k1k.yaml (main branch) for use with the sa-submission-q1-2026 srtctl which does not support zip_override syntax. STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) Made-with: Cursor * Bump MTP 8k1k health check timeout from 60min to 120min EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min of CUDA graph capture alone on the decode worker. Combined with model loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init exceeds the 60min (360 attempts x 10s) health check window on cold nodes. Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP recipe variants to provide sufficient headroom. Made-with: Cursor * Fix cuda-graph-max-bs on MTP maxtpt decode workers With data-parallel-size=8 and dp-attention, the scheduler distributes requests across 8 DP replicas. Each replica only sees max-running-requests/dp concurrent sequences, so cuda-graph-max-bs should be divided by dp accordingly. Previous values caused CUDA graph capture of 99 batch sizes per DP replica with EAGLE speculative decoding, taking 80+ minutes and exceeding the health check timeout. Corrected values capture only 35 batch sizes, finishing in ~1 minute with no performance regression. Validated: MTP 3P1D output throughput 15,124 tok/s matches reference 14,995 tok/s (+0.9%). maxtpt_0: 128 -> 16 (max-running=128, dp=8) maxtpt_1: 256 -> 32 (max-running=256, dp=8) maxtpt_2: 512 -> 64 (max-running=512, dp=8) maxtpt_3: 1024 -> 128 (max-running=1024, dp=8) Made-with: Cursor * fix rebase --------- Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com> Co-authored-by: Kyle Liang <kylliang@nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: nlevin-ui <nlevin@nvidia.com> Co-authored-by: Elnifio <elnifio0519@gmail.com> Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>

configs for gb300-fp8-no-mtp

c02d59d

gracehonv requested a review from kyleliang-nv January 29, 2026 20:06

gracehonv requested a review from trevor-m January 29, 2026 20:06

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

typos from coderabbit addressed

bb7c617

gracehonv merged commit e70d47b into main Jan 29, 2026
4 of 5 checks passed

ishandhanani pushed a commit that referenced this pull request Jan 29, 2026

Merge pull request #118 from ishandhanani/grho/Jan29_a

0e12309

configs for gb300-fp8-no-mtp

This was referenced Feb 3, 2026

Grho/feb2 b add >5 TTFT load points #132

Merged

Add all GB200/GB300 FP8 MTP recipes #134

Merged

Changing model_path to dsr1-fp8 for consistency for all STP SGLANg ac… #151

Merged

csahithi pushed a commit to csahithi/srt-slurm that referenced this pull request Mar 25, 2026

Merge pull request ishandhanani#118 from ishandhanani/grho/Jan29_a

1f388d2

configs for gb300-fp8-no-mtp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs for gb300-fp8-no-mtp#118

configs for gb300-fp8-no-mtp#118
gracehonv merged 2 commits intomainfrom
grho/Jan29_a

gracehonv commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gracehonv commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gracehonv commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading