Improve prefill and max throughput config perf for B200 by weireweire · Pull Request #214 · ishandhanani/srt-slurm

weireweire · 2026-03-12T08:57:07Z

All config can reuse this prefill config. But decode config for low-latency can do further fine tune. Should give large perf gain.

Summary: tune recipes/b200-fp8/8k1k.yaml prefill settings, adjust max throughput decode settings and benchmark concurrencies, and add launch queue plus per-token group quantization environment knobs. Testing: not run.

Summary by CodeRabbit

Chores
- Updated base configuration version to 0.9.1 with optimized resource allocation
- Enhanced parallelism and load-balancing settings across deployment modes
- Removed deprecated feature flags for improved configuration consistency
- Added new performance optimization flags
- Extended concurrency support to handle up to 2048 concurrent requests
- Improved memory management and token limit configurations

coderabbitai · 2026-03-12T08:57:22Z

📝 Walkthrough

Walkthrough

This PR updates the FP8 inference recipe configuration file to upgrade Dynamo version to 0.9.1, remove legacy JIT DeepGEMM flags, introduce new CUDA optimization and quantization flags, expand parallelism and resource settings for both STP and MTP modes, and increase concurrency levels from 128-256 to 288-2048.

Changes

Cohort / File(s)	Summary
FP8 Recipe Configuration `recipes/b200-fp8/8k1k.yaml`	Updated global Dynamo version to 0.9.1, removed SGLANG_ENABLE_JIT_DEEPGEMM flag, added CUDA_SCALE_LAUNCH_QUEUES and SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2 flags. Expanded parallelism settings (data/expert-parallelism), enabled dp-attention/dp-lm-head, added round-robin load-balancing. Extended STP/MTP node arrays to length 4, adjusted memory fractions and token limits. Increased benchmark concurrency from 128-256 to 288-2048. Added MoE-dense TP configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

ishandhanani/srt-slurm#78: Shares direct configuration key modifications for MTP/STP throughput tuning including max-running-requests, concurrency adjustments, and parallelism settings in FP8 recipe YAMLs.
ishandhanani/srt-slurm#148: Related through sglang_config modifications across prefill/decode paths with changes to load-balancing, CUDA-graph, memory fractions, and parallelism flags in FP8 recipe files.

Suggested reviewers

ishandhanani
kyleliang-nv

Poem

🐰 A config file hops with glee,
New flags and tunings set so free!
From 256 to 2048 we scale,
With CUDA queues and throughput's tale,
Dynamo dances, optimized to roam! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main changes: improvements to prefill and max throughput configuration for the B200 model, which aligns with the core objective of tuning prefill settings and adjusting max-throughput decode settings.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

improve prefill config perf and max throughput config for B200

a76b6bf

weireweire changed the title ~~Improve prefill and max throughput config for B200~~ Improve prefill and max throughput config perf for B200 Mar 12, 2026

weireweire merged commit 9fe6c37 into ishandhanani:main Mar 12, 2026
4 of 5 checks passed

coderabbitai bot mentioned this pull request Mar 12, 2026

Update b200 fp8 8k1k stp low latency #215

Merged

coderabbitai bot mentioned this pull request Mar 23, 2026

Improve max-throughput and lowest latency point for DSR1 B200 8k1k. #227

Merged

coderabbitai bot mentioned this pull request Mar 30, 2026

recipes: consolidate gb200 fp8 8k1k overrides #230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve prefill and max throughput config perf for B200#214

Improve prefill and max throughput config perf for B200#214
weireweire merged 1 commit intoishandhanani:mainfrom
weireweire:refine-b200-config

weireweire commented Mar 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weireweire commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weireweire commented Mar 12, 2026 •

edited

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading