Skip to content

Add MTP (Multi-Token Prediction) recipe variants for H200 configurations#120

Merged
ishandhanani merged 8 commits intoishandhanani:mainfrom
xutizhou:main
Feb 5, 2026
Merged

Add MTP (Multi-Token Prediction) recipe variants for H200 configurations#120
ishandhanani merged 8 commits intoishandhanani:mainfrom
xutizhou:main

Conversation

@xutizhou
Copy link
Copy Markdown
Contributor

@xutizhou xutizhou commented Jan 30, 2026

Summary

This PR adds MTP (Multi-Token Prediction / Speculative Decoding) recipe variants for all H200 configurations and tunes the parameters for optimal performance.

Changes

New MTP Recipes Added:

  • recipes/h200/1k1k/: 4 new MTP variants (bs128-agg-tp, bs256-1p6d-dep, bs256-1p6d-tp, low-latency-1p9d)
  • recipes/h200/8k1k/: 6 new MTP variants (bs128-1p1d-dep, bs128-agg-tp, bs16-1p3d, bs4-1p7d, bs64-2p3d, bs8-1p6d)

MTP Configuration:

  • Added SGLANG_ENABLE_SPEC_V2 environment variable
  • Added speculative decoding parameters (speculative-algorithm: EAGLE, speculative-num-steps, etc.)
  • Tuned batch sizes: MTP uses smaller batch sizes (1/4 to 1/8 of STP) to accommodate draft workers
  • Adjusted mem-fraction-static to leave memory for MTP draft workers

Benchmark Improvements:

  • Extended concurrency ranges for better Pareto curve coverage
  • Example: bs128-1p1d-dep concurrencies expanded from 64x128x256 to 32x64x128x256x512

Test Plan

  • MTP recipes tested on H200 cluster
  • Pareto curves generated comparing MTP vs non-MTP performance
  • All configurations show MTP speedup in user latency (TPOT)

Results

MTP provides significant improvements in user-facing latency across all configurations:

  • agg-tp (1k1k): ~20-25% faster user speed at same throughput
  • bs256-1p6d (1k1k): ~30-45% faster user speed
  • low-latency-1p9d: 10-15% improvement in low-latency scenarios
  • 8k1k configs: Consistent rightward shift in Pareto curves

Summary by CodeRabbit

  • New Features
    • Added many H200 FP8 deployment recipes covering aggregation, disaggregated, and single-node topologies.
    • Introduced multi-stage (prefill/decode) and single-stage variants with varied model-parallel layouts and batch-size options.
    • Built-in MTP speculative decoding (EAGLE) and tuning knobs for memory, CUDA-graph, and streaming behavior.
    • Included benchmarking configurations for throughput/latency profiles and large context support (1k/8k).

Add speculative decoding (MTP) versions for all H200 8k1k recipes:
- bs128-1p1d-dep-mtp.yaml
- bs128-agg-tp-mtp.yaml
- bs16-1p3d-mtp.yaml
- bs4-1p7d-mtp.yaml
- bs64-2p3d-mtp.yaml
- bs8-1p6d-mtp.yaml

MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params.
Add speculative decoding (MTP) versions for all H200 1k1k recipes:
- bs128-agg-tp-mtp.yaml
- bs256-1p6d-dep-mtp.yaml
- bs256-1p6d-tp-mtp.yaml
- low-latency-1p9d-mtp.yaml

MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params.
Adjust MTP configurations to use smaller batch sizes compared to non-MTP:
- 1k1k: MTP batch size = STP / 4 (512→128, 256→64)
- 8k1k: MTP batch size = STP / 8 (256→32, 128→16, 32→4, 16→2, 8→2)

Also reduce mem-fraction-static from 0.82-0.88 to 0.75-0.80 to leave
more memory for MTP draft workers and avoid CUDA OOM errors.

This matches the pattern used in trtllm/h200 MTP configs where batch
sizes are significantly reduced to accommodate speculative decoding.
- Extend benchmark concurrencies to cover wider range for better Pareto curves
- Increase mem-fraction-static, max-running-requests, and cuda-graph-max-bs
  for MTP configs to improve throughput
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 30, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds 10 new SGLang YAML deployment recipes for H200 GPUs, defining aggregated and disaggregated prefill/decode topologies with FP8 settings, disaggregation parameters, memory/token budgets, CUDA-graph caps, and MTP speculative decoding options across various batch and node configurations.

Changes

Cohort / File(s) Summary
H200 1k1k Aggregate
recipes/h200/1k1k/bs128-agg-tp-mtp.yaml
New single-aggregate (1 node, 8 GPUs) TP config with FP8, memory/prefill limits, CUDA-graph cap, and MTP speculative settings (EAGLE, steps/topk/draft-tokens).
H200 1k1k Disaggregated
recipes/h200/1k1k/bs256-1p6d-dep-mtp.yaml, recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml, recipes/h200/1k1k/low-latency-1p9d-mtp.yaml
Added multi-node prefill/decode recipes with separate env blocks, detailed sglang_config for prefill vs decode, disaggregation (port 30001, nixl), memory/token budgets, load-balancing, and MTP options.
H200 8k1k Aggregate
recipes/h200/8k1k/bs128-agg-tp-mtp.yaml
New single-aggregate (1 node, 8 GPUs) TP config mirroring 1k1k aggregate with FP8 and MTP settings.
H200 8k1k Disaggregated (various topologies)
recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml, recipes/h200/8k1k/bs16-1p3d-mtp.yaml, recipes/h200/8k1k/bs4-1p7d-mtp.yaml, recipes/h200/8k1k/bs8-1p6d-mtp.yaml, recipes/h200/8k1k/bs64-2p3d-mtp.yaml
Multiple new prefill/decode recipes (1p1d–2p3d) with per-phase envs, disaggregation bootstrap/transfer, per-phase parallelism (tp/dp/ep), mem-fraction/static token limits, stream-intervals, cuda-graph caps, and consistent MTP speculative config (EAGLE, steps=2, topk=1, draft-tokens=3).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I hop through YAML fields at night,
adding nodes and MTP delight,
ports and tokens in tidy rows,
speculative drafts where logic grows,
H200 dreams in FP8 light. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding MTP recipe variants for H200 configurations, which aligns with all the new YAML files in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml`:
- Around line 101-107: The benchmark concurrencies ("concurrencies" field set to
"512x1024x2048") exceed the runtime decode caps; either increase the decode
limits by raising max-running-requests and cuda-graph-max-bs to at least 2048
(so they match the highest concurrency) or revert the "concurrencies" value back
to a supported set like "128x256x512"; update the fields named
max-running-requests and cuda-graph-max-bs (or the decode configuration block)
to the new numeric limits if you choose to raise limits, or change the
benchmark.concurrencies string to the lower values if you choose to align with
current caps.

In `@recipes/h200/8k1k/bs8-1p6d-mtp.yaml`:
- Around line 101-106: The benchmark concurrency list in benchmark.concurrencies
includes "32" which exceeds the decode server caps set by max-running-requests
and cuda-graph-max-bs (both currently 16); either remove "32" from the
benchmark.concurrencies string (e.g., change "2x4x8x16x32" to "2x4x8x16") or
raise both decode config keys max-running-requests and cuda-graph-max-bs to 32
so the decode server can actually run 32 concurrent requests; update the
corresponding values for the decode service (max-running-requests and
cuda-graph-max-bs) or adjust benchmark.concurrencies accordingly to keep them
consistent.
🧹 Nitpick comments (6)
recipes/h200/8k1k/bs64-2p3d-mtp.yaml (2)

36-40: Minor: Trailing whitespace on line 39.

There's a trailing space after ep-size: 1. Consider removing for consistency.

🧹 Proposed fix
       # Parallelism
       tp-size: 8
       dp-size: 1
-      ep-size: 1 
+      ep-size: 1

89-90: Minor: Trailing whitespace on line 90.

There's a trailing space after max-total-tokens: 128000.

🧹 Proposed fix
       context-length: 72000
-      max-total-tokens: 128000 
+      max-total-tokens: 128000
recipes/h200/8k1k/bs16-1p3d.yaml (1)

96-96: Align benchmark concurrencies with server caps if you want true 64‑way runs.

max-running-requests and cuda-graph-max-bs are 32, so a 64 concurrency run will queue/cap rather than execute 64 active requests. If the goal is true 64‑way runs, consider raising those caps; otherwise, drop 64 from the benchmark list to keep the curve interpretable.

🔧 Optional tweak to keep concurrencies within current caps
-  concurrencies: "4x8x16x32x64"
+  concurrencies: "4x8x16x32"
recipes/h200/8k1k/bs16-1p3d-mtp.yaml (2)

65-93: Verify MTP memory headroom with mem-fraction-static: 0.82.

MTP draft workers can reduce usable memory headroom. If you see OOMs or paging during runs, consider lowering the decode mem-fraction-static to reserve more space.

🔧 Example adjustment if headroom proves tight
-      mem-fraction-static: 0.82
+      mem-fraction-static: 0.78

100-105: Benchmark concurrencies include 64; ensure server caps allow 64 active requests.

With max-running-requests/cuda-graph-max-bs at 32, 64 concurrency will queue/cap rather than run 64 active requests. Either raise caps or keep concurrencies ≤ 32 to avoid skewed Pareto points.

🔧 Optional tweak to keep concurrencies within current caps
-  concurrencies: "4x8x16x32x64"
+  concurrencies: "4x8x16x32"
recipes/h200/8k1k/bs128-agg-tp-mtp.yaml (1)

1-1: Consider: Filename indicates bs128 but max-running-requests is 32.

The filename bs128-agg-tp-mtp.yaml suggests batch size 128, but max-running-requests and cuda-graph-max-bs are both set to 32 (consistent with "MTP = STP / 4" per PR objectives). This naming convention ties MTP variants to their STP counterparts, which aids comparison.

If this naming is intentional for traceability to the STP baseline, no change needed. Otherwise, consider renaming to reflect actual batch size for clarity (e.g., bs32-agg-tp-mtp.yaml).

Comment on lines +101 to +107
benchmark:
type: "sa-bench"
isl: 1024
osl: 1024
# concurrencies: "128x256x512"
concurrencies: "512x1024x2048"
req_rate: "inf"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "bs256-1p6d-tp-mtp.yaml" -o -name "*.yaml" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 48


🏁 Script executed:

# First, let me search for the file in the repository
find . -name "bs256-1p6d-tp-mtp.yaml" 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 108


🏁 Script executed:

# Let's explore the directory structure to find the YAML file
fd "bs256-1p6d-tp-mtp.yaml"

Repository: ishandhanani/srt-slurm

Length of output: 106


🏁 Script executed:

# Search for the file pattern
fd "\.yaml$" | grep -E "(h200|recipes)"

Repository: ishandhanani/srt-slurm

Length of output: 24678


🏁 Script executed:

cat -n recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 3184


Benchmark concurrencies exceed max-running-requests (128).
Line 106 targets 512/1024/2048 while the decode configuration caps at 128 (line 92–93), causing the server to queue requests and preventing the benchmark from fully testing the intended concurrency levels. Raise max-running-requests/cuda-graph-max-bs to support higher concurrencies, or revert to the 128/256/512 set to match current limits.

🔧 Option: align concurrencies with current runtime caps
-  # concurrencies: "128x256x512"
-  concurrencies: "512x1024x2048"
+  concurrencies: "128x256x512"
🤖 Prompt for AI Agents
In `@recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml` around lines 101 - 107, The
benchmark concurrencies ("concurrencies" field set to "512x1024x2048") exceed
the runtime decode caps; either increase the decode limits by raising
max-running-requests and cuda-graph-max-bs to at least 2048 (so they match the
highest concurrency) or revert the "concurrencies" value back to a supported set
like "128x256x512"; update the fields named max-running-requests and
cuda-graph-max-bs (or the decode configuration block) to the new numeric limits
if you choose to raise limits, or change the benchmark.concurrencies string to
the lower values if you choose to align with current caps.

Comment on lines +101 to +106
benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "2x4x8x16x32"
req_rate: "inf"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "bs8-1p6d-mtp.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 103


🏁 Script executed:

cat -n recipes/h200/8k1k/bs8-1p6d-mtp.yaml | head -120

Repository: ishandhanani/srt-slurm

Length of output: 3136


Benchmark concurrency 32 exceeds decode server limits.
Line 105 sets concurrency to test up to 32, but line 92–93 configure decode with max-running-requests: 16 and cuda-graph-max-bs: 16. The benchmark will be throttled to 16 concurrent requests when testing concurrency=32. Either raise these caps to 32 or remove 32 from the concurrency list.

🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs8-1p6d-mtp.yaml` around lines 101 - 106, The benchmark
concurrency list in benchmark.concurrencies includes "32" which exceeds the
decode server caps set by max-running-requests and cuda-graph-max-bs (both
currently 16); either remove "32" from the benchmark.concurrencies string (e.g.,
change "2x4x8x16x32" to "2x4x8x16") or raise both decode config keys
max-running-requests and cuda-graph-max-bs to 32 so the decode server can
actually run 32 concurrent requests; update the corresponding values for the
decode service (max-running-requests and cuda-graph-max-bs) or adjust
benchmark.concurrencies accordingly to keep them consistent.

@nvpohanh
Copy link
Copy Markdown

nvpohanh commented Feb 2, 2026

@ishandhanani could you review and merge? thanks

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml`:
- Around line 103-106: The EAGLE speculative config is missing the required
speculative-draft-model-path parameter; add a new key named
speculative-draft-model-path under the same block where speculative-algorithm:
"EAGLE" is set and assign it a valid EAGLE draft model identifier (e.g.,
"jamesliu/sglang-EAGLE3-Llama-...") so that speculative decoding parameters
(speculative-num-steps, speculative-eagle-topk, speculative-num-draft-tokens)
can function.

In `@recipes/h200/8k1k/bs128-agg-tp-mtp.yaml`:
- Around line 27-31: The config sets trust-remote-code: true which allows
executing arbitrary remote model code; confirm whether this deployment trusts
the remote repo or change to trust-remote-code: false and instead pin the model
to a specific commit/sha (or add a checksum) to avoid remote code execution;
update the recipe entry for served-model-name / model-path to either set
trust-remote-code: false or add a clear comment and a pinned ref (e.g., commit
tag or digest) and include a short justification in the PR so reviewers know
this was an intentional security decision.
- Around line 44-65: The max-running-requests value (max-running-requests: 32)
limits throughput below the highest benchmark points in benchmark.concurrencies
("1x4x16x32x64x128x256"); update max-running-requests to at least the highest
concurrency (e.g., 256) so it doesn't clip results, or alternatively trim
benchmark.concurrencies to remove values above 32; locate the keys
max-running-requests and benchmark.concurrencies in the YAML (symbols:
max-running-requests and benchmark.concurrencies) and make the corresponding
change so the cap and concurrency list are aligned.

In `@recipes/h200/8k1k/bs4-1p7d-mtp.yaml`:
- Line 45: Remove the trailing whitespace after the value for the ep-size
setting so the line reads "ep-size: 1" with no extra space; locate the line
containing the key "ep-size" in the recipe (ep-size: 1) and trim the trailing
space to match file formatting consistency.
🧹 Nitpick comments (1)
recipes/h200/1k1k/low-latency-1p9d-mtp.yaml (1)

43-45: Minor formatting: trailing whitespace and extra blank line.

Line 45 has trailing whitespace after ep-size: 1, and there's an extra blank line at 57. These don't affect functionality but could be cleaned up for consistency.

Suggested diff
       # Parallelism
       tp-size: 8
       dp-size: 1
-      ep-size: 1 
+      ep-size: 1

       # KV cache and attention
       attention-backend: "flashinfer"

       # Radix cache disabled
       disable-radix-cache: true

       # Other flags
       # stream-interval: 50
       max-running-requests: 256
-      

Also applies to: 56-57

Comment on lines +103 to +106
speculative-algorithm: "EAGLE"
speculative-num-steps: 2
speculative-eagle-topk: 1
speculative-num-draft-tokens: 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

SGLang v0.5.8 EAGLE speculative decoding support

💡 Result:

Yes—SGLang v0.5.8 (PyPI upload: 2026‑01‑23) includes EAGLE-based speculative decoding support (both EAGLE‑2 and EAGLE‑3) and documents how to run it via server flags. [1][2]

How to enable (server CLI)

  • EAGLE‑2: set --speculative-algorithm EAGLE
  • EAGLE‑3: set --speculative-algorithm EAGLE3
  • Provide a draft model via required: --speculative-draft-model-path ...
    Key tuning flags (apply to EAGLE‑2/‑3): --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens. [2]

Example (EAGLE‑3):

python3 -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 32

[2]

EAGLE‑2 “FR-Spec” token-map optimization

For EAGLE‑2, SGLang also documents an additional optimization enabled with:

  • --speculative-token-map ... [2]

Notes / compatibility

SGLang’s docs note speculative decoding is compatible with radix cache and chunked prefill. [3]

Sources: PyPI release page for v0.5.8 [1]; SGLang speculative decoding docs [2][3].


Add required speculative-draft-model-path parameter for EAGLE configuration.

SGLang v0.5.8 supports EAGLE speculative decoding and the specified parameters (speculative-num-steps, speculative-eagle-topk, speculative-num-draft-tokens). However, the configuration is missing the required speculative-draft-model-path parameter, which must point to a valid EAGLE draft model (e.g., jamesliu/sglang-EAGLE3-Llama-...). Without it, the EAGLE speculative decoding will not function.

🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml` around lines 103 - 106, The EAGLE
speculative config is missing the required speculative-draft-model-path
parameter; add a new key named speculative-draft-model-path under the same block
where speculative-algorithm: "EAGLE" is set and assign it a valid EAGLE draft
model identifier (e.g., "jamesliu/sglang-EAGLE3-Llama-...") so that speculative
decoding parameters (speculative-num-steps, speculative-eagle-topk,
speculative-num-draft-tokens) can function.

Comment on lines +27 to +31
served-model-name: "deepseek-ai/DeepSeek-R1"
model-path: "/model/"
skip-tokenizer-init: true
trust-remote-code: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Confirm trust-remote-code: true is acceptable for this deployment.

Line 30 enables execution of remote model code, which is a meaningful security posture risk if the model repo isn’t tightly controlled or pinned.

🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs128-agg-tp-mtp.yaml` around lines 27 - 31, The config
sets trust-remote-code: true which allows executing arbitrary remote model code;
confirm whether this deployment trusts the remote repo or change to
trust-remote-code: false and instead pin the model to a specific commit/sha (or
add a checksum) to avoid remote code execution; update the recipe entry for
served-model-name / model-path to either set trust-remote-code: false or add a
clear comment and a pinned ref (e.g., commit tag or digest) and include a short
justification in the PR so reviewers know this was an intentional security
decision.

Comment on lines +44 to +65
max-running-requests: 32 # sum of all dp

# Memory and token limits
mem-fraction-static: 0.75
max-prefill-tokens: 32768
chunked-prefill-size: 32768

# CUDA graphs
cuda-graph-max-bs: 32

# MTP settings
speculative-algorithm: "EAGLE"
speculative-num-steps: 2
speculative-eagle-topk: 1
speculative-num-draft-tokens: 3

benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "1x4x16x32x64x128x256"
req_rate: "inf"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align max-running-requests with benchmark concurrencies.

Line 44 caps running requests at 32, but Line 64 includes concurrencies up to 256. That will cap/flatten higher points and distort the Pareto curve. Either raise the limit or trim the concurrency list to ≤32.

🔧 Suggested adjustment (trim concurrencies)
-  concurrencies: "1x4x16x32x64x128x256"
+  concurrencies: "1x4x16x32"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
max-running-requests: 32 # sum of all dp
# Memory and token limits
mem-fraction-static: 0.75
max-prefill-tokens: 32768
chunked-prefill-size: 32768
# CUDA graphs
cuda-graph-max-bs: 32
# MTP settings
speculative-algorithm: "EAGLE"
speculative-num-steps: 2
speculative-eagle-topk: 1
speculative-num-draft-tokens: 3
benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "1x4x16x32x64x128x256"
req_rate: "inf"
max-running-requests: 32 # sum of all dp
# Memory and token limits
mem-fraction-static: 0.75
max-prefill-tokens: 32768
chunked-prefill-size: 32768
# CUDA graphs
cuda-graph-max-bs: 32
# MTP settings
speculative-algorithm: "EAGLE"
speculative-num-steps: 2
speculative-eagle-topk: 1
speculative-num-draft-tokens: 3
benchmark:
type: "sa-bench"
isl: 8192
osl: 1024
concurrencies: "1x4x16x32"
req_rate: "inf"
🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs128-agg-tp-mtp.yaml` around lines 44 - 65, The
max-running-requests value (max-running-requests: 32) limits throughput below
the highest benchmark points in benchmark.concurrencies
("1x4x16x32x64x128x256"); update max-running-requests to at least the highest
concurrency (e.g., 256) so it doesn't clip results, or alternatively trim
benchmark.concurrencies to remove values above 32; locate the keys
max-running-requests and benchmark.concurrencies in the YAML (symbols:
max-running-requests and benchmark.concurrencies) and make the corresponding
change so the cap and concurrency list are aligned.

# Parallelism
tp-size: 8
dp-size: 1
ep-size: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor: Trailing whitespace.

Line 45 has a trailing space after the value. While this shouldn't affect parsing, it's inconsistent with the rest of the file.

Suggested fix
-      ep-size: 1 
+      ep-size: 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ep-size: 1
ep-size: 1
🤖 Prompt for AI Agents
In `@recipes/h200/8k1k/bs4-1p7d-mtp.yaml` at line 45, Remove the trailing
whitespace after the value for the ep-size setting so the line reads "ep-size:
1" with no extra space; locate the line containing the key "ep-size" in the
recipe (ep-size: 1) and trim the trailing space to match file formatting
consistency.

@ishandhanani ishandhanani merged commit db11108 into ishandhanani:main Feb 5, 2026
1 check was pending
ishandhanani added a commit that referenced this pull request Feb 5, 2026
…ons (#120)

* Add MTP recipe variants for H200 8k1k configurations

Add speculative decoding (MTP) versions for all H200 8k1k recipes:
- bs128-1p1d-dep-mtp.yaml
- bs128-agg-tp-mtp.yaml
- bs16-1p3d-mtp.yaml
- bs4-1p7d-mtp.yaml
- bs64-2p3d-mtp.yaml
- bs8-1p6d-mtp.yaml

MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params.

* Add MTP recipe variants for H200 1k1k configurations

Add speculative decoding (MTP) versions for all H200 1k1k recipes:
- bs128-agg-tp-mtp.yaml
- bs256-1p6d-dep-mtp.yaml
- bs256-1p6d-tp-mtp.yaml
- low-latency-1p9d-mtp.yaml

MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params.

* Reduce batch sizes for MTP configs based on TRTLLM patterns

Adjust MTP configurations to use smaller batch sizes compared to non-MTP:
- 1k1k: MTP batch size = STP / 4 (512→128, 256→64)
- 8k1k: MTP batch size = STP / 8 (256→32, 128→16, 32→4, 16→2, 8→2)

Also reduce mem-fraction-static from 0.82-0.88 to 0.75-0.80 to leave
more memory for MTP draft workers and avoid CUDA OOM errors.

This matches the pattern used in trtllm/h200 MTP configs where batch
sizes are significantly reduced to accommodate speculative decoding.

* Expand concurrency ranges and tune MTP memory settings for 8k1k recipes

- Extend benchmark concurrencies to cover wider range for better Pareto curves
- Increase mem-fraction-static, max-running-requests, and cuda-graph-max-bs
  for MTP configs to improve throughput

* Revert non mtp

* Update h200 mtp recipes

* Update h200 mtp recipes

---------

Co-authored-by: ishandhanani <ishandhanani@gmail.com>
@coderabbitai coderabbitai bot mentioned this pull request Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants