This repository was archived by the owner on Apr 20, 2026. It is now read-only.
modify b200 sgl fp4 non-mtp configs#168
Merged
ishandhanani merged 1 commit intoishandhanani:sa-submission-q1-2026from Feb 10, 2026
Merged
modify b200 sgl fp4 non-mtp configs#168ishandhanani merged 1 commit intoishandhanani:sa-submission-q1-2026from
ishandhanani merged 1 commit intoishandhanani:sa-submission-q1-2026from
Conversation
Contributor
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
f4e23a0
into
ishandhanani:sa-submission-q1-2026
1 check passed
csahithi
added a commit
to csahithi/srt-slurm
that referenced
this pull request
Mar 25, 2026
ishandhanani
added a commit
that referenced
this pull request
Mar 25, 2026
* Merge pull request #118 from ishandhanani/grho/Jan29_a configs for gb300-fp8-no-mtp * Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container and specify nginx container * Add GB200-FP8-1k8k * Update GB200 FP8 1k8k recipes * typo * only build for 9.0 * go * go * again * try again * go * Update gb200 recipes (#130) * Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile * fix container * clean up old * Add 1k1k STP and MTP disagg H100 configs (#140) * Add 1k1k STP and MTP disagg H100 configs * Update H100 FP8 configs with verified 29 Pareto-optimal points Replace previous configs with verified Pareto-optimal configurations: - 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) - 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308) Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1 * Update H100 configs to tensorrtllm-runtime:0.8.1.post3 Update all 29 H100 FP8 config files to use the new container: - nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3 * updates the recipe for Dynamo-SGLang B200 submissions * adds modified B200-fp8 recipes * updates the recipes * prune the concurrency * Add B200 MTP FP4 SGLANG recipes * Update model path cand container Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> * modify b200 sgl fp4 non-mtp configs (#168) * adds conc=128 point * adds 1p2d config * modify job name to support multiple gh runners (#182) * Add resolved B200 FP8 8k1k recipe variants for CI compatibility 14 standalone recipe files resolved from the consolidated 8k1k.yaml (main branch) for use with the sa-submission-q1-2026 srtctl which does not support zip_override syntax. STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) Made-with: Cursor * Bump MTP 8k1k health check timeout from 60min to 120min EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min of CUDA graph capture alone on the decode worker. Combined with model loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init exceeds the 60min (360 attempts x 10s) health check window on cold nodes. Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP recipe variants to provide sufficient headroom. Made-with: Cursor * Fix cuda-graph-max-bs on MTP maxtpt decode workers With data-parallel-size=8 and dp-attention, the scheduler distributes requests across 8 DP replicas. Each replica only sees max-running-requests/dp concurrent sequences, so cuda-graph-max-bs should be divided by dp accordingly. Previous values caused CUDA graph capture of 99 batch sizes per DP replica with EAGLE speculative decoding, taking 80+ minutes and exceeding the health check timeout. Corrected values capture only 35 batch sizes, finishing in ~1 minute with no performance regression. Validated: MTP 3P1D output throughput 15,124 tok/s matches reference 14,995 tok/s (+0.9%). maxtpt_0: 128 -> 16 (max-running=128, dp=8) maxtpt_1: 256 -> 32 (max-running=256, dp=8) maxtpt_2: 512 -> 64 (max-running=512, dp=8) maxtpt_3: 1024 -> 128 (max-running=1024, dp=8) Made-with: Cursor * fix rebase --------- Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com> Co-authored-by: Kyle Liang <kylliang@nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: nlevin-ui <nlevin@nvidia.com> Co-authored-by: Elnifio <elnifio0519@gmail.com> Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.