Skip to content

configs for gb300-fp8-no-mtp#118

Merged
gracehonv merged 2 commits intomainfrom
grho/Jan29_a
Jan 29, 2026
Merged

configs for gb300-fp8-no-mtp#118
gracehonv merged 2 commits intomainfrom
grho/Jan29_a

Conversation

@gracehonv
Copy link
Copy Markdown
Collaborator

@gracehonv gracehonv commented Jan 29, 2026

Summary by CodeRabbit

  • New Features

    • Added GB300 FP8 deployment profiles for low-latency, mid-throughput, and max-throughput modes covering 1k/8k sequence variants.
    • Included performance/benchmark presets and CUDA-graph tuning options for prefill and decode phases.
  • Chores

    • Added comprehensive environment and resource configuration sets to support FP8 inference on GB300 hardware.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

Six new YAML deployment profiles were added for FP8 inference on GB300 hardware, covering low-latency, mid, and max-throughput targets for two model sizes (1k1k and 8k1k). Each file encodes resources, environment hooks, sglang_config, and benchmarking parameters.

Changes

Cohort / File(s) Summary
1k1k FP8 Deployment Profiles
recipies/gb300-fp8/1k1k/stp/low-latency.yaml, recipies/gb300-fp8/1k1k/stp/max.yaml, recipies/gb300-fp8/1k1k/stp/mid.yaml
Added three new configs for 1k1k FP8: resource allocation (prefill/decode nodes, workers, GPUs), backend env hooks, detailed sglang_config for prefill/decode (model, TP/DP/EP, kv-cache dtype, attention backend, disaggregation, DeepEP/MoE), CUDA-graph and benchmark settings.
8k1k FP8 Deployment Profiles
recipies/gb300-fp8/8k1k/stp/low-latency.yaml, recipies/gb300-fp8/8k1k/stp/max.yaml, recipies/gb300-fp8/8k1k/stp/mid.yaml
Added three new configs for 8k1k FP8: same structure as 1k1k set but tuned for larger model (resource counts, parallelism, memory/disaggregation, DeepEP/MoE options), plus CUDA-graph and benchmark parameters.

Sequence Diagram(s)

(Skipped — changes are configuration files without new multi-component control flow requiring sequence visualization.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • ishandhanani

Poem

🐰 I hopped through YAML rows and leaves,
Tucked FP8 dreams beneath GB300 eaves.
Low-latency whispers, throughput sings—
Six new paths for model wings. 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The PR title 'configs for gb300-fp8-no-mtp' is vague and generic, using non-descriptive phrasing that doesn't clearly convey the specific changes made. Consider using a more descriptive title that clearly indicates the main change, such as 'Add GB300 FP8 configuration files for low-latency, mid, and max throughput profiles' to better reflect the actual additions.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gracehonv gracehonv requested a review from trevor-m January 29, 2026 20:06
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@recipies/gb300-fp8/1k1k/stp/max.yaml`:
- Around line 1-3: The comment header at the top of the recipe incorrectly reads
"GB200"; update that comment to reference "GB300" so it matches the
configuration name "gb300-1k1k-fp8-max" (keep the rest of the header text
unchanged) — locate the top-of-file comment and change "GB200" to "GB300".

In `@recipies/gb300-fp8/1k1k/stp/mid.yaml`:
- Around line 1-2: Update the top comment to match the actual config: replace
the incorrect header "GB200 FP8 Max Throughput Configuration" with a correct
description reflecting GB300 and mid-tier throughput (for example "GB300 FP8 Mid
Throughput Configuration"), ensuring it aligns with the config name field name:
"gb300-1k1k-fp8-mid".

In `@recipies/gb300-fp8/8k1k/stp/low-latency.yaml`:
- Around line 118-123: The benchmark block has isl set to 8102 (benchmark: isl:
8102) which likely is a typo compared to other 8k1k configs that use 8192;
verify the intended target and if it's a mistake, update the isl value to 8192
in the same benchmark section (and ensure req_rate, osl and concurrencies remain
unchanged), and run any config validation or tests to confirm the change; if
8102 was intentional, add an inline comment explaining why it differs from the
8192 baseline.

In `@recipies/gb300-fp8/8k1k/stp/max.yaml`:
- Around line 1-3: The top-line comment is inconsistent with the configuration
name/gpu_type: update the header comment "# GB200 FP8 Max Throughput
Configuration" to reflect "GB300" so it matches the configuration name
"gb300-8k1k-fp8-max" (or alternatively rename the configuration/name/gpu_type to
GB200 if that was intended); ensure the header text and the symbol name
"gb300-8k1k-fp8-max" (and any gpu_type field) are consistent.

In `@recipies/gb300-fp8/8k1k/stp/mid.yaml`:
- Around line 1-3: The comment header is incorrect: replace "GB200 FP8 Max
Throughput Configuration" with a header that matches this recipe (GB300,
mid-throughput). Update the top-of-file comment to something like "GB300 FP8 Mid
Throughput Configuration" so it aligns with the name field "gb300-8k1k-fp8-mid"
and the file's intended throughput tier.
🧹 Nitpick comments (1)
recipies/gb300-fp8/8k1k/stp/low-latency.yaml (1)

61-116: Configuration uses different parameter naming style.

This file uses tensor-parallel-size, data-parallel-size, expert-parallel-size while other configurations in this PR use the shorter tp-size, dp-size, ep-size style. Both likely work, but consider standardizing for maintainability across configurations.

Comment thread recipies/gb300-fp8/1k1k/stp/max.yaml Outdated
Comment thread recipies/gb300-fp8/1k1k/stp/mid.yaml Outdated
Comment thread recipies/gb300-fp8/8k1k/stp/low-latency.yaml
Comment thread recipies/gb300-fp8/8k1k/stp/max.yaml Outdated
Comment thread recipies/gb300-fp8/8k1k/stp/mid.yaml Outdated
@gracehonv gracehonv merged commit e70d47b into main Jan 29, 2026
4 of 5 checks passed
ishandhanani pushed a commit that referenced this pull request Jan 29, 2026
csahithi pushed a commit to csahithi/srt-slurm that referenced this pull request Mar 25, 2026
ishandhanani added a commit that referenced this pull request Mar 25, 2026
* Merge pull request #118 from ishandhanani/grho/Jan29_a

configs for gb300-fp8-no-mtp

* Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container
and specify nginx container

* Add GB200-FP8-1k8k

* Update GB200 FP8 1k8k recipes

* typo

* only build for 9.0

* go

* go

* again

* try again

* go

* Update gb200 recipes (#130)

* Update GB200-FP8 configs

* Update GB200-FP4 configs

* Add nginx container to all GB200-FP8 configs

* Add nginx container to GB200-FP4 configs

* Cleanup configs

* Switch to use fast DG cache compile

* fix container

* clean up old

* Add 1k1k STP and MTP disagg H100 configs (#140)

* Add 1k1k STP and MTP disagg H100 configs

* Update H100 FP8 configs with verified 29 Pareto-optimal points

Replace previous configs with verified Pareto-optimal configurations:
- 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229)
- 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916)
- 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154)
- 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308)

Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1

* Update H100 configs to tensorrtllm-runtime:0.8.1.post3

Update all 29 H100 FP8 config files to use the new container:
- nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3

* updates the recipe for Dynamo-SGLang B200 submissions

* adds modified B200-fp8 recipes

* updates the recipes

* prune the concurrency

* Add B200 MTP FP4 SGLANG recipes

* Update model path cand container

Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>

* modify b200 sgl fp4 non-mtp configs (#168)

* adds conc=128 point

* adds 1p2d config

* modify job name to support multiple gh runners (#182)

* Add resolved B200 FP8 8k1k recipe variants for CI compatibility

14 standalone recipe files resolved from the consolidated 8k1k.yaml
(main branch) for use with the sa-submission-q1-2026 srtctl which
does not support zip_override syntax.

STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d)
MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d)
Made-with: Cursor

* Bump MTP 8k1k health check timeout from 60min to 120min

EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min
of CUDA graph capture alone on the decode worker. Combined with model
loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init
exceeds the 60min (360 attempts x 10s) health check window on cold nodes.

Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP
recipe variants to provide sufficient headroom.

Made-with: Cursor

* Fix cuda-graph-max-bs on MTP maxtpt decode workers

With data-parallel-size=8 and dp-attention, the scheduler distributes
requests across 8 DP replicas. Each replica only sees
max-running-requests/dp concurrent sequences, so cuda-graph-max-bs
should be divided by dp accordingly.

Previous values caused CUDA graph capture of 99 batch sizes per DP
replica with EAGLE speculative decoding, taking 80+ minutes and
exceeding the health check timeout. Corrected values capture only
35 batch sizes, finishing in ~1 minute with no performance regression.

Validated: MTP 3P1D output throughput 15,124 tok/s matches reference
14,995 tok/s (+0.9%).

  maxtpt_0: 128 -> 16  (max-running=128, dp=8)
  maxtpt_1: 256 -> 32  (max-running=256, dp=8)
  maxtpt_2: 512 -> 64  (max-running=512, dp=8)
  maxtpt_3: 1024 -> 128 (max-running=1024, dp=8)

Made-with: Cursor

* fix rebase

---------

Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com>
Co-authored-by: Kyle Liang <kylliang@nvidia.com>
Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Co-authored-by: nlevin-ui <nlevin@nvidia.com>
Co-authored-by: Elnifio <elnifio0519@gmail.com>
Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant