Add 1k1k STP and MTP disagg H100 configs#140
Merged
ishandhanani merged 3 commits intosa-submission-q1-2026from Feb 6, 2026
Merged
Add 1k1k STP and MTP disagg H100 configs#140ishandhanani merged 3 commits intosa-submission-q1-2026from
ishandhanani merged 3 commits intosa-submission-q1-2026from
Conversation
Contributor
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Replace previous configs with verified Pareto-optimal configurations: - 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) - 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308) Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1
Update all 29 H100 FP8 config files to use the new container: - nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3
csahithi
pushed a commit
to csahithi/srt-slurm
that referenced
this pull request
Mar 25, 2026
* Add 1k1k STP and MTP disagg H100 configs * Update H100 FP8 configs with verified 29 Pareto-optimal points Replace previous configs with verified Pareto-optimal configurations: - 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) - 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308) Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1 * Update H100 configs to tensorrtllm-runtime:0.8.1.post3 Update all 29 H100 FP8 config files to use the new container: - nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3
ishandhanani
added a commit
that referenced
this pull request
Mar 25, 2026
* Merge pull request #118 from ishandhanani/grho/Jan29_a configs for gb300-fp8-no-mtp * Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container and specify nginx container * Add GB200-FP8-1k8k * Update GB200 FP8 1k8k recipes * typo * only build for 9.0 * go * go * again * try again * go * Update gb200 recipes (#130) * Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile * fix container * clean up old * Add 1k1k STP and MTP disagg H100 configs (#140) * Add 1k1k STP and MTP disagg H100 configs * Update H100 FP8 configs with verified 29 Pareto-optimal points Replace previous configs with verified Pareto-optimal configurations: - 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) - 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308) Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1 * Update H100 configs to tensorrtllm-runtime:0.8.1.post3 Update all 29 H100 FP8 config files to use the new container: - nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3 * updates the recipe for Dynamo-SGLang B200 submissions * adds modified B200-fp8 recipes * updates the recipes * prune the concurrency * Add B200 MTP FP4 SGLANG recipes * Update model path cand container Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> * modify b200 sgl fp4 non-mtp configs (#168) * adds conc=128 point * adds 1p2d config * modify job name to support multiple gh runners (#182) * Add resolved B200 FP8 8k1k recipe variants for CI compatibility 14 standalone recipe files resolved from the consolidated 8k1k.yaml (main branch) for use with the sa-submission-q1-2026 srtctl which does not support zip_override syntax. STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) Made-with: Cursor * Bump MTP 8k1k health check timeout from 60min to 120min EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min of CUDA graph capture alone on the decode worker. Combined with model loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init exceeds the 60min (360 attempts x 10s) health check window on cold nodes. Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP recipe variants to provide sufficient headroom. Made-with: Cursor * Fix cuda-graph-max-bs on MTP maxtpt decode workers With data-parallel-size=8 and dp-attention, the scheduler distributes requests across 8 DP replicas. Each replica only sees max-running-requests/dp concurrent sequences, so cuda-graph-max-bs should be divided by dp accordingly. Previous values caused CUDA graph capture of 99 batch sizes per DP replica with EAGLE speculative decoding, taking 80+ minutes and exceeding the health check timeout. Corrected values capture only 35 batch sizes, finishing in ~1 minute with no performance regression. Validated: MTP 3P1D output throughput 15,124 tok/s matches reference 14,995 tok/s (+0.9%). maxtpt_0: 128 -> 16 (max-running=128, dp=8) maxtpt_1: 256 -> 32 (max-running=256, dp=8) maxtpt_2: 512 -> 64 (max-running=512, dp=8) maxtpt_3: 1024 -> 128 (max-running=1024, dp=8) Made-with: Cursor * fix rebase --------- Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com> Co-authored-by: Kyle Liang <kylliang@nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: nlevin-ui <nlevin@nvidia.com> Co-authored-by: Elnifio <elnifio0519@gmail.com> Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add 18 verified Pareto-optimal H100 FP8 disaggregated TensorRT-LLM configurations for 1024/1024 ISL/OSL
9 STP configs covering concurrencies: 6, 9, 30, 60, 231, 462, 924, 1845, 4916
9 MTP configs covering concurrencies: 6, 9, 30, 60, 117, 231, 615, 616, 1229