Skip to content

Add 1k1k STP and MTP disagg H100 configs#140

Merged
ishandhanani merged 3 commits intosa-submission-q1-2026from
trtllm-h100
Feb 6, 2026
Merged

Add 1k1k STP and MTP disagg H100 configs#140
ishandhanani merged 3 commits intosa-submission-q1-2026from
trtllm-h100

Conversation

@nlevin-ui
Copy link
Copy Markdown
Collaborator

Add 18 verified Pareto-optimal H100 FP8 disaggregated TensorRT-LLM configurations for 1024/1024 ISL/OSL
9 STP configs covering concurrencies: 6, 9, 30, 60, 231, 462, 924, 1845, 4916
9 MTP configs covering concurrencies: 6, 9, 30, 60, 117, 231, 615, 616, 1229

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 4, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch trtllm-h100

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Replace previous configs with verified Pareto-optimal configurations:
- 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229)
- 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916)
- 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154)
- 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308)

Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1
Update all 29 H100 FP8 config files to use the new container:
- nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3
@ishandhanani ishandhanani merged commit 4183c95 into sa-submission-q1-2026 Feb 6, 2026
1 check passed
csahithi pushed a commit to csahithi/srt-slurm that referenced this pull request Mar 25, 2026
* Add 1k1k STP and MTP disagg H100 configs

* Update H100 FP8 configs with verified 29 Pareto-optimal points

Replace previous configs with verified Pareto-optimal configurations:
- 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229)
- 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916)
- 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154)
- 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308)

Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1

* Update H100 configs to tensorrtllm-runtime:0.8.1.post3

Update all 29 H100 FP8 config files to use the new container:
- nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3
ishandhanani added a commit that referenced this pull request Mar 25, 2026
* Merge pull request #118 from ishandhanani/grho/Jan29_a

configs for gb300-fp8-no-mtp

* Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container
and specify nginx container

* Add GB200-FP8-1k8k

* Update GB200 FP8 1k8k recipes

* typo

* only build for 9.0

* go

* go

* again

* try again

* go

* Update gb200 recipes (#130)

* Update GB200-FP8 configs

* Update GB200-FP4 configs

* Add nginx container to all GB200-FP8 configs

* Add nginx container to GB200-FP4 configs

* Cleanup configs

* Switch to use fast DG cache compile

* fix container

* clean up old

* Add 1k1k STP and MTP disagg H100 configs (#140)

* Add 1k1k STP and MTP disagg H100 configs

* Update H100 FP8 configs with verified 29 Pareto-optimal points

Replace previous configs with verified Pareto-optimal configurations:
- 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229)
- 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916)
- 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154)
- 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308)

Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1

* Update H100 configs to tensorrtllm-runtime:0.8.1.post3

Update all 29 H100 FP8 config files to use the new container:
- nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3

* updates the recipe for Dynamo-SGLang B200 submissions

* adds modified B200-fp8 recipes

* updates the recipes

* prune the concurrency

* Add B200 MTP FP4 SGLANG recipes

* Update model path cand container

Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>

* modify b200 sgl fp4 non-mtp configs (#168)

* adds conc=128 point

* adds 1p2d config

* modify job name to support multiple gh runners (#182)

* Add resolved B200 FP8 8k1k recipe variants for CI compatibility

14 standalone recipe files resolved from the consolidated 8k1k.yaml
(main branch) for use with the sa-submission-q1-2026 srtctl which
does not support zip_override syntax.

STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d)
MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d)
Made-with: Cursor

* Bump MTP 8k1k health check timeout from 60min to 120min

EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min
of CUDA graph capture alone on the decode worker. Combined with model
loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init
exceeds the 60min (360 attempts x 10s) health check window on cold nodes.

Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP
recipe variants to provide sufficient headroom.

Made-with: Cursor

* Fix cuda-graph-max-bs on MTP maxtpt decode workers

With data-parallel-size=8 and dp-attention, the scheduler distributes
requests across 8 DP replicas. Each replica only sees
max-running-requests/dp concurrent sequences, so cuda-graph-max-bs
should be divided by dp accordingly.

Previous values caused CUDA graph capture of 99 batch sizes per DP
replica with EAGLE speculative decoding, taking 80+ minutes and
exceeding the health check timeout. Corrected values capture only
35 batch sizes, finishing in ~1 minute with no performance regression.

Validated: MTP 3P1D output throughput 15,124 tok/s matches reference
14,995 tok/s (+0.9%).

  maxtpt_0: 128 -> 16  (max-running=128, dp=8)
  maxtpt_1: 256 -> 32  (max-running=256, dp=8)
  maxtpt_2: 512 -> 64  (max-running=512, dp=8)
  maxtpt_3: 1024 -> 128 (max-running=1024, dp=8)

Made-with: Cursor

* fix rebase

---------

Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com>
Co-authored-by: Kyle Liang <kylliang@nvidia.com>
Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Co-authored-by: nlevin-ui <nlevin@nvidia.com>
Co-authored-by: Elnifio <elnifio0519@gmail.com>
Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants