Improve prefill and max throughput config perf for B200#214
Improve prefill and max throughput config perf for B200#214weireweire merged 1 commit intoishandhanani:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR updates the FP8 inference recipe configuration file to upgrade Dynamo version to 0.9.1, remove legacy JIT DeepGEMM flags, introduce new CUDA optimization and quantization flags, expand parallelism and resource settings for both STP and MTP modes, and increase concurrency levels from 128-256 to 288-2048. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
All config can reuse this prefill config. But decode config for low-latency can do further fine tune. Should give large perf gain.
Summary: tune recipes/b200-fp8/8k1k.yaml prefill settings, adjust max throughput decode settings and benchmark concurrencies, and add launch queue plus per-token group quantization environment knobs. Testing: not run.
Summary by CodeRabbit