Add MTP (Multi-Token Prediction) recipe variants for H200 configurations#120
Add MTP (Multi-Token Prediction) recipe variants for H200 configurations#120ishandhanani merged 8 commits intoishandhanani:mainfrom
Conversation
Add speculative decoding (MTP) versions for all H200 8k1k recipes: - bs128-1p1d-dep-mtp.yaml - bs128-agg-tp-mtp.yaml - bs16-1p3d-mtp.yaml - bs4-1p7d-mtp.yaml - bs64-2p3d-mtp.yaml - bs8-1p6d-mtp.yaml MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params.
Add speculative decoding (MTP) versions for all H200 1k1k recipes: - bs128-agg-tp-mtp.yaml - bs256-1p6d-dep-mtp.yaml - bs256-1p6d-tp-mtp.yaml - low-latency-1p9d-mtp.yaml MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params.
Adjust MTP configurations to use smaller batch sizes compared to non-MTP: - 1k1k: MTP batch size = STP / 4 (512→128, 256→64) - 8k1k: MTP batch size = STP / 8 (256→32, 128→16, 32→4, 16→2, 8→2) Also reduce mem-fraction-static from 0.82-0.88 to 0.75-0.80 to leave more memory for MTP draft workers and avoid CUDA OOM errors. This matches the pattern used in trtllm/h200 MTP configs where batch sizes are significantly reduced to accommodate speculative decoding.
- Extend benchmark concurrencies to cover wider range for better Pareto curves - Increase mem-fraction-static, max-running-requests, and cuda-graph-max-bs for MTP configs to improve throughput
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughAdds 10 new SGLang YAML deployment recipes for H200 GPUs, defining aggregated and disaggregated prefill/decode topologies with FP8 settings, disaggregation parameters, memory/token budgets, CUDA-graph caps, and MTP speculative decoding options across various batch and node configurations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml`:
- Around line 101-107: The benchmark concurrencies ("concurrencies" field set to
"512x1024x2048") exceed the runtime decode caps; either increase the decode
limits by raising max-running-requests and cuda-graph-max-bs to at least 2048
(so they match the highest concurrency) or revert the "concurrencies" value back
to a supported set like "128x256x512"; update the fields named
max-running-requests and cuda-graph-max-bs (or the decode configuration block)
to the new numeric limits if you choose to raise limits, or change the
benchmark.concurrencies string to the lower values if you choose to align with
current caps.
In `@recipes/h200/8k1k/bs8-1p6d-mtp.yaml`:
- Around line 101-106: The benchmark concurrency list in benchmark.concurrencies
includes "32" which exceeds the decode server caps set by max-running-requests
and cuda-graph-max-bs (both currently 16); either remove "32" from the
benchmark.concurrencies string (e.g., change "2x4x8x16x32" to "2x4x8x16") or
raise both decode config keys max-running-requests and cuda-graph-max-bs to 32
so the decode server can actually run 32 concurrent requests; update the
corresponding values for the decode service (max-running-requests and
cuda-graph-max-bs) or adjust benchmark.concurrencies accordingly to keep them
consistent.
🧹 Nitpick comments (6)
recipes/h200/8k1k/bs64-2p3d-mtp.yaml (2)
36-40: Minor: Trailing whitespace on line 39.There's a trailing space after
ep-size: 1. Consider removing for consistency.🧹 Proposed fix
# Parallelism tp-size: 8 dp-size: 1 - ep-size: 1 + ep-size: 1
89-90: Minor: Trailing whitespace on line 90.There's a trailing space after
max-total-tokens: 128000.🧹 Proposed fix
context-length: 72000 - max-total-tokens: 128000 + max-total-tokens: 128000recipes/h200/8k1k/bs16-1p3d.yaml (1)
96-96: Align benchmark concurrencies with server caps if you want true 64‑way runs.
max-running-requestsandcuda-graph-max-bsare 32, so a 64 concurrency run will queue/cap rather than execute 64 active requests. If the goal is true 64‑way runs, consider raising those caps; otherwise, drop 64 from the benchmark list to keep the curve interpretable.🔧 Optional tweak to keep concurrencies within current caps
- concurrencies: "4x8x16x32x64" + concurrencies: "4x8x16x32"recipes/h200/8k1k/bs16-1p3d-mtp.yaml (2)
65-93: Verify MTP memory headroom withmem-fraction-static: 0.82.MTP draft workers can reduce usable memory headroom. If you see OOMs or paging during runs, consider lowering the decode
mem-fraction-staticto reserve more space.🔧 Example adjustment if headroom proves tight
- mem-fraction-static: 0.82 + mem-fraction-static: 0.78
100-105: Benchmark concurrencies include 64; ensure server caps allow 64 active requests.With
max-running-requests/cuda-graph-max-bsat 32, 64 concurrency will queue/cap rather than run 64 active requests. Either raise caps or keep concurrencies ≤ 32 to avoid skewed Pareto points.🔧 Optional tweak to keep concurrencies within current caps
- concurrencies: "4x8x16x32x64" + concurrencies: "4x8x16x32"recipes/h200/8k1k/bs128-agg-tp-mtp.yaml (1)
1-1: Consider: Filename indicates bs128 butmax-running-requestsis 32.The filename
bs128-agg-tp-mtp.yamlsuggests batch size 128, butmax-running-requestsandcuda-graph-max-bsare both set to 32 (consistent with "MTP = STP / 4" per PR objectives). This naming convention ties MTP variants to their STP counterparts, which aids comparison.If this naming is intentional for traceability to the STP baseline, no change needed. Otherwise, consider renaming to reflect actual batch size for clarity (e.g.,
bs32-agg-tp-mtp.yaml).
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 1024 | ||
| osl: 1024 | ||
| # concurrencies: "128x256x512" | ||
| concurrencies: "512x1024x2048" | ||
| req_rate: "inf" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /tmp && find . -name "bs256-1p6d-tp-mtp.yaml" -o -name "*.yaml" | head -20Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
# First, let me search for the file in the repository
find . -name "bs256-1p6d-tp-mtp.yaml" 2>/dev/nullRepository: ishandhanani/srt-slurm
Length of output: 108
🏁 Script executed:
# Let's explore the directory structure to find the YAML file
fd "bs256-1p6d-tp-mtp.yaml"Repository: ishandhanani/srt-slurm
Length of output: 106
🏁 Script executed:
# Search for the file pattern
fd "\.yaml$" | grep -E "(h200|recipes)"Repository: ishandhanani/srt-slurm
Length of output: 24678
🏁 Script executed:
cat -n recipes/h200/1k1k/bs256-1p6d-tp-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 3184
Benchmark concurrencies exceed max-running-requests (128).
Line 106 targets 512/1024/2048 while the decode configuration caps at 128 (line 92–93), causing the server to queue requests and preventing the benchmark from fully testing the intended concurrency levels. Raise max-running-requests/cuda-graph-max-bs to support higher concurrencies, or revert to the 128/256/512 set to match current limits.
🔧 Option: align concurrencies with current runtime caps
- # concurrencies: "128x256x512"
- concurrencies: "512x1024x2048"
+ concurrencies: "128x256x512"🤖 Prompt for AI Agents
In `@recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml` around lines 101 - 107, The
benchmark concurrencies ("concurrencies" field set to "512x1024x2048") exceed
the runtime decode caps; either increase the decode limits by raising
max-running-requests and cuda-graph-max-bs to at least 2048 (so they match the
highest concurrency) or revert the "concurrencies" value back to a supported set
like "128x256x512"; update the fields named max-running-requests and
cuda-graph-max-bs (or the decode configuration block) to the new numeric limits
if you choose to raise limits, or change the benchmark.concurrencies string to
the lower values if you choose to align with current caps.
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 8192 | ||
| osl: 1024 | ||
| concurrencies: "2x4x8x16x32" | ||
| req_rate: "inf" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "bs8-1p6d-mtp.yaml" -type fRepository: ishandhanani/srt-slurm
Length of output: 103
🏁 Script executed:
cat -n recipes/h200/8k1k/bs8-1p6d-mtp.yaml | head -120Repository: ishandhanani/srt-slurm
Length of output: 3136
Benchmark concurrency 32 exceeds decode server limits.
Line 105 sets concurrency to test up to 32, but line 92–93 configure decode with max-running-requests: 16 and cuda-graph-max-bs: 16. The benchmark will be throttled to 16 concurrent requests when testing concurrency=32. Either raise these caps to 32 or remove 32 from the concurrency list.
🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs8-1p6d-mtp.yaml` around lines 101 - 106, The benchmark
concurrency list in benchmark.concurrencies includes "32" which exceeds the
decode server caps set by max-running-requests and cuda-graph-max-bs (both
currently 16); either remove "32" from the benchmark.concurrencies string (e.g.,
change "2x4x8x16x32" to "2x4x8x16") or raise both decode config keys
max-running-requests and cuda-graph-max-bs to 32 so the decode server can
actually run 32 concurrent requests; update the corresponding values for the
decode service (max-running-requests and cuda-graph-max-bs) or adjust
benchmark.concurrencies accordingly to keep them consistent.
|
@ishandhanani could you review and merge? thanks |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml`:
- Around line 103-106: The EAGLE speculative config is missing the required
speculative-draft-model-path parameter; add a new key named
speculative-draft-model-path under the same block where speculative-algorithm:
"EAGLE" is set and assign it a valid EAGLE draft model identifier (e.g.,
"jamesliu/sglang-EAGLE3-Llama-...") so that speculative decoding parameters
(speculative-num-steps, speculative-eagle-topk, speculative-num-draft-tokens)
can function.
In `@recipes/h200/8k1k/bs128-agg-tp-mtp.yaml`:
- Around line 27-31: The config sets trust-remote-code: true which allows
executing arbitrary remote model code; confirm whether this deployment trusts
the remote repo or change to trust-remote-code: false and instead pin the model
to a specific commit/sha (or add a checksum) to avoid remote code execution;
update the recipe entry for served-model-name / model-path to either set
trust-remote-code: false or add a clear comment and a pinned ref (e.g., commit
tag or digest) and include a short justification in the PR so reviewers know
this was an intentional security decision.
- Around line 44-65: The max-running-requests value (max-running-requests: 32)
limits throughput below the highest benchmark points in benchmark.concurrencies
("1x4x16x32x64x128x256"); update max-running-requests to at least the highest
concurrency (e.g., 256) so it doesn't clip results, or alternatively trim
benchmark.concurrencies to remove values above 32; locate the keys
max-running-requests and benchmark.concurrencies in the YAML (symbols:
max-running-requests and benchmark.concurrencies) and make the corresponding
change so the cap and concurrency list are aligned.
In `@recipes/h200/8k1k/bs4-1p7d-mtp.yaml`:
- Line 45: Remove the trailing whitespace after the value for the ep-size
setting so the line reads "ep-size: 1" with no extra space; locate the line
containing the key "ep-size" in the recipe (ep-size: 1) and trim the trailing
space to match file formatting consistency.
🧹 Nitpick comments (1)
recipes/h200/1k1k/low-latency-1p9d-mtp.yaml (1)
43-45: Minor formatting: trailing whitespace and extra blank line.Line 45 has trailing whitespace after
ep-size: 1, and there's an extra blank line at 57. These don't affect functionality but could be cleaned up for consistency.Suggested diff
# Parallelism tp-size: 8 dp-size: 1 - ep-size: 1 + ep-size: 1 # KV cache and attention attention-backend: "flashinfer" # Radix cache disabled disable-radix-cache: true # Other flags # stream-interval: 50 max-running-requests: 256 -Also applies to: 56-57
| speculative-algorithm: "EAGLE" | ||
| speculative-num-steps: 2 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 3 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
SGLang v0.5.8 EAGLE speculative decoding support
💡 Result:
Yes—SGLang v0.5.8 (PyPI upload: 2026‑01‑23) includes EAGLE-based speculative decoding support (both EAGLE‑2 and EAGLE‑3) and documents how to run it via server flags. [1][2]
How to enable (server CLI)
- EAGLE‑2: set
--speculative-algorithm EAGLE - EAGLE‑3: set
--speculative-algorithm EAGLE3 - Provide a draft model via required:
--speculative-draft-model-path ...
Key tuning flags (apply to EAGLE‑2/‑3):--speculative-num-steps,--speculative-eagle-topk,--speculative-num-draft-tokens. [2]
Example (EAGLE‑3):
python3 -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 32[2]
EAGLE‑2 “FR-Spec” token-map optimization
For EAGLE‑2, SGLang also documents an additional optimization enabled with:
--speculative-token-map ...[2]
Notes / compatibility
SGLang’s docs note speculative decoding is compatible with radix cache and chunked prefill. [3]
Sources: PyPI release page for v0.5.8 [1]; SGLang speculative decoding docs [2][3].
Add required speculative-draft-model-path parameter for EAGLE configuration.
SGLang v0.5.8 supports EAGLE speculative decoding and the specified parameters (speculative-num-steps, speculative-eagle-topk, speculative-num-draft-tokens). However, the configuration is missing the required speculative-draft-model-path parameter, which must point to a valid EAGLE draft model (e.g., jamesliu/sglang-EAGLE3-Llama-...). Without it, the EAGLE speculative decoding will not function.
🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml` around lines 103 - 106, The EAGLE
speculative config is missing the required speculative-draft-model-path
parameter; add a new key named speculative-draft-model-path under the same block
where speculative-algorithm: "EAGLE" is set and assign it a valid EAGLE draft
model identifier (e.g., "jamesliu/sglang-EAGLE3-Llama-...") so that speculative
decoding parameters (speculative-num-steps, speculative-eagle-topk,
speculative-num-draft-tokens) can function.
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| model-path: "/model/" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
|
|
There was a problem hiding this comment.
Confirm trust-remote-code: true is acceptable for this deployment.
Line 30 enables execution of remote model code, which is a meaningful security posture risk if the model repo isn’t tightly controlled or pinned.
🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs128-agg-tp-mtp.yaml` around lines 27 - 31, The config
sets trust-remote-code: true which allows executing arbitrary remote model code;
confirm whether this deployment trusts the remote repo or change to
trust-remote-code: false and instead pin the model to a specific commit/sha (or
add a checksum) to avoid remote code execution; update the recipe entry for
served-model-name / model-path to either set trust-remote-code: false or add a
clear comment and a pinned ref (e.g., commit tag or digest) and include a short
justification in the PR so reviewers know this was an intentional security
decision.
| max-running-requests: 32 # sum of all dp | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.75 | ||
| max-prefill-tokens: 32768 | ||
| chunked-prefill-size: 32768 | ||
|
|
||
| # CUDA graphs | ||
| cuda-graph-max-bs: 32 | ||
|
|
||
| # MTP settings | ||
| speculative-algorithm: "EAGLE" | ||
| speculative-num-steps: 2 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 3 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 8192 | ||
| osl: 1024 | ||
| concurrencies: "1x4x16x32x64x128x256" | ||
| req_rate: "inf" |
There was a problem hiding this comment.
Align max-running-requests with benchmark concurrencies.
Line 44 caps running requests at 32, but Line 64 includes concurrencies up to 256. That will cap/flatten higher points and distort the Pareto curve. Either raise the limit or trim the concurrency list to ≤32.
🔧 Suggested adjustment (trim concurrencies)
- concurrencies: "1x4x16x32x64x128x256"
+ concurrencies: "1x4x16x32"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| max-running-requests: 32 # sum of all dp | |
| # Memory and token limits | |
| mem-fraction-static: 0.75 | |
| max-prefill-tokens: 32768 | |
| chunked-prefill-size: 32768 | |
| # CUDA graphs | |
| cuda-graph-max-bs: 32 | |
| # MTP settings | |
| speculative-algorithm: "EAGLE" | |
| speculative-num-steps: 2 | |
| speculative-eagle-topk: 1 | |
| speculative-num-draft-tokens: 3 | |
| benchmark: | |
| type: "sa-bench" | |
| isl: 8192 | |
| osl: 1024 | |
| concurrencies: "1x4x16x32x64x128x256" | |
| req_rate: "inf" | |
| max-running-requests: 32 # sum of all dp | |
| # Memory and token limits | |
| mem-fraction-static: 0.75 | |
| max-prefill-tokens: 32768 | |
| chunked-prefill-size: 32768 | |
| # CUDA graphs | |
| cuda-graph-max-bs: 32 | |
| # MTP settings | |
| speculative-algorithm: "EAGLE" | |
| speculative-num-steps: 2 | |
| speculative-eagle-topk: 1 | |
| speculative-num-draft-tokens: 3 | |
| benchmark: | |
| type: "sa-bench" | |
| isl: 8192 | |
| osl: 1024 | |
| concurrencies: "1x4x16x32" | |
| req_rate: "inf" |
🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs128-agg-tp-mtp.yaml` around lines 44 - 65, The
max-running-requests value (max-running-requests: 32) limits throughput below
the highest benchmark points in benchmark.concurrencies
("1x4x16x32x64x128x256"); update max-running-requests to at least the highest
concurrency (e.g., 256) so it doesn't clip results, or alternatively trim
benchmark.concurrencies to remove values above 32; locate the keys
max-running-requests and benchmark.concurrencies in the YAML (symbols:
max-running-requests and benchmark.concurrencies) and make the corresponding
change so the cap and concurrency list are aligned.
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 1 | ||
| ep-size: 1 |
There was a problem hiding this comment.
Minor: Trailing whitespace.
Line 45 has a trailing space after the value. While this shouldn't affect parsing, it's inconsistent with the rest of the file.
Suggested fix
- ep-size: 1
+ ep-size: 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ep-size: 1 | |
| ep-size: 1 |
🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs4-1p7d-mtp.yaml` at line 45, Remove the trailing
whitespace after the value for the ep-size setting so the line reads "ep-size:
1" with no extra space; locate the line containing the key "ep-size" in the
recipe (ep-size: 1) and trim the trailing space to match file formatting
consistency.
…ons (#120) * Add MTP recipe variants for H200 8k1k configurations Add speculative decoding (MTP) versions for all H200 8k1k recipes: - bs128-1p1d-dep-mtp.yaml - bs128-agg-tp-mtp.yaml - bs16-1p3d-mtp.yaml - bs4-1p7d-mtp.yaml - bs64-2p3d-mtp.yaml - bs8-1p6d-mtp.yaml MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params. * Add MTP recipe variants for H200 1k1k configurations Add speculative decoding (MTP) versions for all H200 1k1k recipes: - bs128-agg-tp-mtp.yaml - bs256-1p6d-dep-mtp.yaml - bs256-1p6d-tp-mtp.yaml - low-latency-1p9d-mtp.yaml MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params. * Reduce batch sizes for MTP configs based on TRTLLM patterns Adjust MTP configurations to use smaller batch sizes compared to non-MTP: - 1k1k: MTP batch size = STP / 4 (512→128, 256→64) - 8k1k: MTP batch size = STP / 8 (256→32, 128→16, 32→4, 16→2, 8→2) Also reduce mem-fraction-static from 0.82-0.88 to 0.75-0.80 to leave more memory for MTP draft workers and avoid CUDA OOM errors. This matches the pattern used in trtllm/h200 MTP configs where batch sizes are significantly reduced to accommodate speculative decoding. * Expand concurrency ranges and tune MTP memory settings for 8k1k recipes - Extend benchmark concurrencies to cover wider range for better Pareto curves - Increase mem-fraction-static, max-running-requests, and cuda-graph-max-bs for MTP configs to improve throughput * Revert non mtp * Update h200 mtp recipes * Update h200 mtp recipes --------- Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Summary
This PR adds MTP (Multi-Token Prediction / Speculative Decoding) recipe variants for all H200 configurations and tunes the parameters for optimal performance.
Changes
New MTP Recipes Added:
recipes/h200/1k1k/: 4 new MTP variants (bs128-agg-tp, bs256-1p6d-dep, bs256-1p6d-tp, low-latency-1p9d)recipes/h200/8k1k/: 6 new MTP variants (bs128-1p1d-dep, bs128-agg-tp, bs16-1p3d, bs4-1p7d, bs64-2p3d, bs8-1p6d)MTP Configuration:
SGLANG_ENABLE_SPEC_V2environment variablespeculative-algorithm: EAGLE,speculative-num-steps, etc.)mem-fraction-staticto leave memory for MTP draft workersBenchmark Improvements:
bs128-1p1d-depconcurrencies expanded from64x128x256to32x64x128x256x512Test Plan
Results
MTP provides significant improvements in user-facing latency across all configurations:
Summary by CodeRabbit