Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.

modify b200 sgl fp4 non-mtp configs#168

Merged
ishandhanani merged 1 commit intoishandhanani:sa-submission-q1-2026from
csahithi:b200-fp4-sgl-non-mtp
Feb 10, 2026
Merged

modify b200 sgl fp4 non-mtp configs#168
ishandhanani merged 1 commit intoishandhanani:sa-submission-q1-2026from
csahithi:b200-fp4-sgl-non-mtp

Conversation

@csahithi
Copy link
Copy Markdown
Contributor

No description provided.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 10, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ishandhanani ishandhanani merged commit f4e23a0 into ishandhanani:sa-submission-q1-2026 Feb 10, 2026
1 check passed
csahithi added a commit to csahithi/srt-slurm that referenced this pull request Mar 25, 2026
ishandhanani added a commit that referenced this pull request Mar 25, 2026
* Merge pull request #118 from ishandhanani/grho/Jan29_a

configs for gb300-fp8-no-mtp

* Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container
and specify nginx container

* Add GB200-FP8-1k8k

* Update GB200 FP8 1k8k recipes

* typo

* only build for 9.0

* go

* go

* again

* try again

* go

* Update gb200 recipes (#130)

* Update GB200-FP8 configs

* Update GB200-FP4 configs

* Add nginx container to all GB200-FP8 configs

* Add nginx container to GB200-FP4 configs

* Cleanup configs

* Switch to use fast DG cache compile

* fix container

* clean up old

* Add 1k1k STP and MTP disagg H100 configs (#140)

* Add 1k1k STP and MTP disagg H100 configs

* Update H100 FP8 configs with verified 29 Pareto-optimal points

Replace previous configs with verified Pareto-optimal configurations:
- 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229)
- 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916)
- 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154)
- 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308)

Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1

* Update H100 configs to tensorrtllm-runtime:0.8.1.post3

Update all 29 H100 FP8 config files to use the new container:
- nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3

* updates the recipe for Dynamo-SGLang B200 submissions

* adds modified B200-fp8 recipes

* updates the recipes

* prune the concurrency

* Add B200 MTP FP4 SGLANG recipes

* Update model path cand container

Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>

* modify b200 sgl fp4 non-mtp configs (#168)

* adds conc=128 point

* adds 1p2d config

* modify job name to support multiple gh runners (#182)

* Add resolved B200 FP8 8k1k recipe variants for CI compatibility

14 standalone recipe files resolved from the consolidated 8k1k.yaml
(main branch) for use with the sa-submission-q1-2026 srtctl which
does not support zip_override syntax.

STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d)
MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d)
Made-with: Cursor

* Bump MTP 8k1k health check timeout from 60min to 120min

EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min
of CUDA graph capture alone on the decode worker. Combined with model
loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init
exceeds the 60min (360 attempts x 10s) health check window on cold nodes.

Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP
recipe variants to provide sufficient headroom.

Made-with: Cursor

* Fix cuda-graph-max-bs on MTP maxtpt decode workers

With data-parallel-size=8 and dp-attention, the scheduler distributes
requests across 8 DP replicas. Each replica only sees
max-running-requests/dp concurrent sequences, so cuda-graph-max-bs
should be divided by dp accordingly.

Previous values caused CUDA graph capture of 99 batch sizes per DP
replica with EAGLE speculative decoding, taking 80+ minutes and
exceeding the health check timeout. Corrected values capture only
35 batch sizes, finishing in ~1 minute with no performance regression.

Validated: MTP 3P1D output throughput 15,124 tok/s matches reference
14,995 tok/s (+0.9%).

  maxtpt_0: 128 -> 16  (max-running=128, dp=8)
  maxtpt_1: 256 -> 32  (max-running=256, dp=8)
  maxtpt_2: 512 -> 64  (max-running=512, dp=8)
  maxtpt_3: 1024 -> 128 (max-running=1024, dp=8)

Made-with: Cursor

* fix rebase

---------

Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com>
Co-authored-by: Kyle Liang <kylliang@nvidia.com>
Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Co-authored-by: nlevin-ui <nlevin@nvidia.com>
Co-authored-by: Elnifio <elnifio0519@gmail.com>
Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com>
Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants