modify b200 sgl fp4 non-mtp configs by csahithi · Pull Request #168 · ishandhanani/srt-slurm

csahithi · 2026-02-10T05:55:17Z

No description provided.

coderabbitai · 2026-02-10T05:55:27Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

* Merge pull request #118 from ishandhanani/grho/Jan29_a configs for gb300-fp8-no-mtp * Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container and specify nginx container * Add GB200-FP8-1k8k * Update GB200 FP8 1k8k recipes * typo * only build for 9.0 * go * go * again * try again * go * Update gb200 recipes (#130) * Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile * fix container * clean up old * Add 1k1k STP and MTP disagg H100 configs (#140) * Add 1k1k STP and MTP disagg H100 configs * Update H100 FP8 configs with verified 29 Pareto-optimal points Replace previous configs with verified Pareto-optimal configurations: - 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) - 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308) Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1 * Update H100 configs to tensorrtllm-runtime:0.8.1.post3 Update all 29 H100 FP8 config files to use the new container: - nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3 * updates the recipe for Dynamo-SGLang B200 submissions * adds modified B200-fp8 recipes * updates the recipes * prune the concurrency * Add B200 MTP FP4 SGLANG recipes * Update model path cand container Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> * modify b200 sgl fp4 non-mtp configs (#168) * adds conc=128 point * adds 1p2d config * modify job name to support multiple gh runners (#182) * Add resolved B200 FP8 8k1k recipe variants for CI compatibility 14 standalone recipe files resolved from the consolidated 8k1k.yaml (main branch) for use with the sa-submission-q1-2026 srtctl which does not support zip_override syntax. STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) Made-with: Cursor * Bump MTP 8k1k health check timeout from 60min to 120min EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min of CUDA graph capture alone on the decode worker. Combined with model loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init exceeds the 60min (360 attempts x 10s) health check window on cold nodes. Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP recipe variants to provide sufficient headroom. Made-with: Cursor * Fix cuda-graph-max-bs on MTP maxtpt decode workers With data-parallel-size=8 and dp-attention, the scheduler distributes requests across 8 DP replicas. Each replica only sees max-running-requests/dp concurrent sequences, so cuda-graph-max-bs should be divided by dp accordingly. Previous values caused CUDA graph capture of 99 batch sizes per DP replica with EAGLE speculative decoding, taking 80+ minutes and exceeding the health check timeout. Corrected values capture only 35 batch sizes, finishing in ~1 minute with no performance regression. Validated: MTP 3P1D output throughput 15,124 tok/s matches reference 14,995 tok/s (+0.9%). maxtpt_0: 128 -> 16 (max-running=128, dp=8) maxtpt_1: 256 -> 32 (max-running=256, dp=8) maxtpt_2: 512 -> 64 (max-running=512, dp=8) maxtpt_3: 1024 -> 128 (max-running=1024, dp=8) Made-with: Cursor * fix rebase --------- Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com> Co-authored-by: Kyle Liang <kylliang@nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: nlevin-ui <nlevin@nvidia.com> Co-authored-by: Elnifio <elnifio0519@gmail.com> Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>

modify b200 sgl fp4 non-mtp configs

3a0fdb9

ishandhanani merged commit f4e23a0 into ishandhanani:sa-submission-q1-2026 Feb 10, 2026
1 check passed

csahithi added a commit to csahithi/srt-slurm that referenced this pull request Mar 25, 2026

modify b200 sgl fp4 non-mtp configs (ishandhanani#168)

5d07960

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modify b200 sgl fp4 non-mtp configs#168

modify b200 sgl fp4 non-mtp configs#168
ishandhanani merged 1 commit intoishandhanani:sa-submission-q1-2026from
csahithi:b200-fp4-sgl-non-mtp

csahithi commented Feb 10, 2026

Uh oh!

coderabbitai Bot commented Feb 10, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

csahithi commented Feb 10, 2026

Uh oh!

coderabbitai Bot commented Feb 10, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants