Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.

Improve prefill and max throughput config perf for B200#214

Merged
weireweire merged 1 commit intoishandhanani:mainfrom
weireweire:refine-b200-config
Mar 12, 2026
Merged

Improve prefill and max throughput config perf for B200#214
weireweire merged 1 commit intoishandhanani:mainfrom
weireweire:refine-b200-config

Conversation

@weireweire
Copy link
Copy Markdown
Collaborator

@weireweire weireweire commented Mar 12, 2026

All config can reuse this prefill config. But decode config for low-latency can do further fine tune. Should give large perf gain.

Summary: tune recipes/b200-fp8/8k1k.yaml prefill settings, adjust max throughput decode settings and benchmark concurrencies, and add launch queue plus per-token group quantization environment knobs. Testing: not run.

Summary by CodeRabbit

  • Chores
    • Updated base configuration version to 0.9.1 with optimized resource allocation
    • Enhanced parallelism and load-balancing settings across deployment modes
    • Removed deprecated feature flags for improved configuration consistency
    • Added new performance optimization flags
    • Extended concurrency support to handle up to 2048 concurrent requests
    • Improved memory management and token limit configurations

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

This PR updates the FP8 inference recipe configuration file to upgrade Dynamo version to 0.9.1, remove legacy JIT DeepGEMM flags, introduce new CUDA optimization and quantization flags, expand parallelism and resource settings for both STP and MTP modes, and increase concurrency levels from 128-256 to 288-2048.

Changes

Cohort / File(s) Summary
FP8 Recipe Configuration
recipes/b200-fp8/8k1k.yaml
Updated global Dynamo version to 0.9.1, removed SGLANG_ENABLE_JIT_DEEPGEMM flag, added CUDA_SCALE_LAUNCH_QUEUES and SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2 flags. Expanded parallelism settings (data/expert-parallelism), enabled dp-attention/dp-lm-head, added round-robin load-balancing. Extended STP/MTP node arrays to length 4, adjusted memory fractions and token limits. Increased benchmark concurrency from 128-256 to 288-2048. Added MoE-dense TP configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • ishandhanani/srt-slurm#78: Shares direct configuration key modifications for MTP/STP throughput tuning including max-running-requests, concurrency adjustments, and parallelism settings in FP8 recipe YAMLs.
  • ishandhanani/srt-slurm#148: Related through sglang_config modifications across prefill/decode paths with changes to load-balancing, CUDA-graph, memory fractions, and parallelism flags in FP8 recipe files.

Suggested reviewers

  • ishandhanani
  • kyleliang-nv

Poem

🐰 A config file hops with glee,
New flags and tunings set so free!
From 256 to 2048 we scale,
With CUDA queues and throughput's tale,
Dynamo dances, optimized to roam! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main changes: improvements to prefill and max throughput configuration for the B200 model, which aligns with the core objective of tuning prefill settings and adjusting max-throughput decode settings.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@weireweire weireweire changed the title Improve prefill and max throughput config for B200 Improve prefill and max throughput config perf for B200 Mar 12, 2026
@weireweire weireweire merged commit 9fe6c37 into ishandhanani:main Mar 12, 2026
4 of 5 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant