Add MTP (Multi-Token Prediction) recipe variants for H200 configurations by xutizhou · Pull Request #120 · ishandhanani/srt-slurm

xutizhou · 2026-01-30T04:03:04Z

Summary

This PR adds MTP (Multi-Token Prediction / Speculative Decoding) recipe variants for all H200 configurations and tunes the parameters for optimal performance.

Changes

New MTP Recipes Added:

recipes/h200/1k1k/: 4 new MTP variants (bs128-agg-tp, bs256-1p6d-dep, bs256-1p6d-tp, low-latency-1p9d)
recipes/h200/8k1k/: 6 new MTP variants (bs128-1p1d-dep, bs128-agg-tp, bs16-1p3d, bs4-1p7d, bs64-2p3d, bs8-1p6d)

MTP Configuration:

Added SGLANG_ENABLE_SPEC_V2 environment variable
Added speculative decoding parameters (speculative-algorithm: EAGLE, speculative-num-steps, etc.)
Tuned batch sizes: MTP uses smaller batch sizes (1/4 to 1/8 of STP) to accommodate draft workers
Adjusted mem-fraction-static to leave memory for MTP draft workers

Benchmark Improvements:

Extended concurrency ranges for better Pareto curve coverage
Example: bs128-1p1d-dep concurrencies expanded from 64x128x256 to 32x64x128x256x512

Test Plan

MTP recipes tested on H200 cluster
Pareto curves generated comparing MTP vs non-MTP performance
All configurations show MTP speedup in user latency (TPOT)

Results

MTP provides significant improvements in user-facing latency across all configurations:

agg-tp (1k1k): ~20-25% faster user speed at same throughput
bs256-1p6d (1k1k): ~30-45% faster user speed
low-latency-1p9d: 10-15% improvement in low-latency scenarios
8k1k configs: Consistent rightward shift in Pareto curves

Summary by CodeRabbit

New Features
- Added many H200 FP8 deployment recipes covering aggregation, disaggregated, and single-node topologies.
- Introduced multi-stage (prefill/decode) and single-stage variants with varied model-parallel layouts and batch-size options.
- Built-in MTP speculative decoding (EAGLE) and tuning knobs for memory, CUDA-graph, and streaming behavior.
- Included benchmarking configurations for throughput/latency profiles and large context support (1k/8k).

Add speculative decoding (MTP) versions for all H200 8k1k recipes: - bs128-1p1d-dep-mtp.yaml - bs128-agg-tp-mtp.yaml - bs16-1p3d-mtp.yaml - bs4-1p7d-mtp.yaml - bs64-2p3d-mtp.yaml - bs8-1p6d-mtp.yaml MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params.

Add speculative decoding (MTP) versions for all H200 1k1k recipes: - bs128-agg-tp-mtp.yaml - bs256-1p6d-dep-mtp.yaml - bs256-1p6d-tp-mtp.yaml - low-latency-1p9d-mtp.yaml MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params.

Adjust MTP configurations to use smaller batch sizes compared to non-MTP: - 1k1k: MTP batch size = STP / 4 (512→128, 256→64) - 8k1k: MTP batch size = STP / 8 (256→32, 128→16, 32→4, 16→2, 8→2) Also reduce mem-fraction-static from 0.82-0.88 to 0.75-0.80 to leave more memory for MTP draft workers and avoid CUDA OOM errors. This matches the pattern used in trtllm/h200 MTP configs where batch sizes are significantly reduced to accommodate speculative decoding.

- Extend benchmark concurrencies to cover wider range for better Pareto curves - Increase mem-fraction-static, max-running-requests, and cuda-graph-max-bs for MTP configs to improve throughput

coderabbitai · 2026-01-30T04:03:41Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds 10 new SGLang YAML deployment recipes for H200 GPUs, defining aggregated and disaggregated prefill/decode topologies with FP8 settings, disaggregation parameters, memory/token budgets, CUDA-graph caps, and MTP speculative decoding options across various batch and node configurations.

Changes

Cohort / File(s)	Summary
H200 1k1k Aggregate `recipes/h200/1k1k/bs128-agg-tp-mtp.yaml`	New single-aggregate (1 node, 8 GPUs) TP config with FP8, memory/prefill limits, CUDA-graph cap, and MTP speculative settings (EAGLE, steps/topk/draft-tokens).
H200 1k1k Disaggregated `recipes/h200/1k1k/bs256-1p6d-dep-mtp.yaml`, `recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml`, `recipes/h200/1k1k/low-latency-1p9d-mtp.yaml`	Added multi-node prefill/decode recipes with separate env blocks, detailed sglang_config for prefill vs decode, disaggregation (port 30001, nixl), memory/token budgets, load-balancing, and MTP options.
H200 8k1k Aggregate `recipes/h200/8k1k/bs128-agg-tp-mtp.yaml`	New single-aggregate (1 node, 8 GPUs) TP config mirroring 1k1k aggregate with FP8 and MTP settings.
H200 8k1k Disaggregated (various topologies) `recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml`, `recipes/h200/8k1k/bs16-1p3d-mtp.yaml`, `recipes/h200/8k1k/bs4-1p7d-mtp.yaml`, `recipes/h200/8k1k/bs8-1p6d-mtp.yaml`, `recipes/h200/8k1k/bs64-2p3d-mtp.yaml`	Multiple new prefill/decode recipes (1p1d–2p3d) with per-phase envs, disaggregation bootstrap/transfer, per-phase parallelism (tp/dp/ep), mem-fraction/static token limits, stream-intervals, cuda-graph caps, and consistent MTP speculative config (EAGLE, steps=2, topk=1, draft-tokens=3).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Add MTP (Multi-Token Prediction) recipe variants for H200 configurations #120 — Adds/modifies H200 MTP YAML deployment recipes with overlapping filenames and sglang_config parameters.
Add h200 config. Fix nginx/dynamo install issue. #82 — Adds/updates H200 deployment recipe YAMLs that overlap in structure and disaggregation settings.
Use mtp3 for 1k8k, 1k1k gb200 fp8 disagg low latency configs #143 — Changes MTP/speculative parameter keys and values in sglang_config blocks similar to these recipes.

Poem

🐰 I hop through YAML fields at night,
adding nodes and MTP delight,
ports and tokens in tidy rows,
speculative drafts where logic grows,
H200 dreams in FP8 light. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding MTP recipe variants for H200 configurations, which aligns with all the new YAML files in the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml`:
- Around line 101-107: The benchmark concurrencies ("concurrencies" field set to
"512x1024x2048") exceed the runtime decode caps; either increase the decode
limits by raising max-running-requests and cuda-graph-max-bs to at least 2048
(so they match the highest concurrency) or revert the "concurrencies" value back
to a supported set like "128x256x512"; update the fields named
max-running-requests and cuda-graph-max-bs (or the decode configuration block)
to the new numeric limits if you choose to raise limits, or change the
benchmark.concurrencies string to the lower values if you choose to align with
current caps.

In `@recipes/h200/8k1k/bs8-1p6d-mtp.yaml`:
- Around line 101-106: The benchmark concurrency list in benchmark.concurrencies
includes "32" which exceeds the decode server caps set by max-running-requests
and cuda-graph-max-bs (both currently 16); either remove "32" from the
benchmark.concurrencies string (e.g., change "2x4x8x16x32" to "2x4x8x16") or
raise both decode config keys max-running-requests and cuda-graph-max-bs to 32
so the decode server can actually run 32 concurrent requests; update the
corresponding values for the decode service (max-running-requests and
cuda-graph-max-bs) or adjust benchmark.concurrencies accordingly to keep them
consistent.

🧹 Nitpick comments (6)

recipes/h200/8k1k/bs64-2p3d-mtp.yaml (2)
36-40: Minor: Trailing whitespace on line 39.

There's a trailing space after ep-size: 1. Consider removing for consistency.
🧹 Proposed fix
       # Parallelism
       tp-size: 8
       dp-size: 1
-      ep-size: 1 
+      ep-size: 1
89-90: Minor: Trailing whitespace on line 90.

There's a trailing space after max-total-tokens: 128000.
🧹 Proposed fix
       context-length: 72000
-      max-total-tokens: 128000 
+      max-total-tokens: 128000
recipes/h200/8k1k/bs16-1p3d.yaml (1)
96-96: Align benchmark concurrencies with server caps if you want true 64‑way runs.

max-running-requests and cuda-graph-max-bs are 32, so a 64 concurrency run will queue/cap rather than execute 64 active requests. If the goal is true 64‑way runs, consider raising those caps; otherwise, drop 64 from the benchmark list to keep the curve interpretable.
🔧 Optional tweak to keep concurrencies within current caps
-  concurrencies: "4x8x16x32x64"
+  concurrencies: "4x8x16x32"
recipes/h200/8k1k/bs16-1p3d-mtp.yaml (2)
65-93: Verify MTP memory headroom with mem-fraction-static: 0.82.

MTP draft workers can reduce usable memory headroom. If you see OOMs or paging during runs, consider lowering the decode mem-fraction-static to reserve more space.
🔧 Example adjustment if headroom proves tight
-      mem-fraction-static: 0.82
+      mem-fraction-static: 0.78
100-105: Benchmark concurrencies include 64; ensure server caps allow 64 active requests.

With max-running-requests/cuda-graph-max-bs at 32, 64 concurrency will queue/cap rather than run 64 active requests. Either raise caps or keep concurrencies ≤ 32 to avoid skewed Pareto points.
🔧 Optional tweak to keep concurrencies within current caps
-  concurrencies: "4x8x16x32x64"
+  concurrencies: "4x8x16x32"
recipes/h200/8k1k/bs128-agg-tp-mtp.yaml (1)

1-1: Consider: Filename indicates bs128 but max-running-requests is 32.

The filename bs128-agg-tp-mtp.yaml suggests batch size 128, but max-running-requests and cuda-graph-max-bs are both set to 32 (consistent with "MTP = STP / 4" per PR objectives). This naming convention ties MTP variants to their STP counterparts, which aids comparison.

If this naming is intentional for traceability to the STP baseline, no change needed. Otherwise, consider renaming to reflect actual batch size for clarity (e.g., bs32-agg-tp-mtp.yaml).

coderabbitai · 2026-01-30T04:08:36Z

+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  # concurrencies: "128x256x512"
+  concurrencies: "512x1024x2048"
+  req_rate: "inf"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "bs256-1p6d-tp-mtp.yaml" -o -name "*.yaml" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

# First, let me search for the file in the repository find . -name "bs256-1p6d-tp-mtp.yaml" 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 108

🏁 Script executed:

# Let's explore the directory structure to find the YAML file fd "bs256-1p6d-tp-mtp.yaml"

Repository: ishandhanani/srt-slurm

Length of output: 106

🏁 Script executed:

# Search for the file pattern fd "\.yaml$" | grep -E "(h200|recipes)"

Repository: ishandhanani/srt-slurm

Length of output: 24678

🏁 Script executed:

cat -n recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 3184

Benchmark concurrencies exceed max-running-requests (128).
Line 106 targets 512/1024/2048 while the decode configuration caps at 128 (line 92–93), causing the server to queue requests and preventing the benchmark from fully testing the intended concurrency levels. Raise max-running-requests/cuda-graph-max-bs to support higher concurrencies, or revert to the 128/256/512 set to match current limits.

🔧 Option: align concurrencies with current runtime caps

- # concurrencies: "128x256x512" - concurrencies: "512x1024x2048" + concurrencies: "128x256x512"

🤖 Prompt for AI Agents

In `@recipes/h200/1k1k/bs256-1p6d-tp-mtp.yaml` around lines 101 - 107, The benchmark concurrencies ("concurrencies" field set to "512x1024x2048") exceed the runtime decode caps; either increase the decode limits by raising max-running-requests and cuda-graph-max-bs to at least 2048 (so they match the highest concurrency) or revert the "concurrencies" value back to a supported set like "128x256x512"; update the fields named max-running-requests and cuda-graph-max-bs (or the decode configuration block) to the new numeric limits if you choose to raise limits, or change the benchmark.concurrencies string to the lower values if you choose to align with current caps.

coderabbitai · 2026-01-30T04:08:36Z

+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "2x4x8x16x32"
+  req_rate: "inf"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "bs8-1p6d-mtp.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 103

🏁 Script executed:

cat -n recipes/h200/8k1k/bs8-1p6d-mtp.yaml | head -120

Repository: ishandhanani/srt-slurm

Length of output: 3136

Benchmark concurrency 32 exceeds decode server limits.
Line 105 sets concurrency to test up to 32, but line 92–93 configure decode with max-running-requests: 16 and cuda-graph-max-bs: 16. The benchmark will be throttled to 16 concurrent requests when testing concurrency=32. Either raise these caps to 32 or remove 32 from the concurrency list.

🤖 Prompt for AI Agents

In `@recipes/h200/8k1k/bs8-1p6d-mtp.yaml` around lines 101 - 106, The benchmark concurrency list in benchmark.concurrencies includes "32" which exceeds the decode server caps set by max-running-requests and cuda-graph-max-bs (both currently 16); either remove "32" from the benchmark.concurrencies string (e.g., change "2x4x8x16x32" to "2x4x8x16") or raise both decode config keys max-running-requests and cuda-graph-max-bs to 32 so the decode server can actually run 32 concurrent requests; update the corresponding values for the decode service (max-running-requests and cuda-graph-max-bs) or adjust benchmark.concurrencies accordingly to keep them consistent.

nvpohanh · 2026-02-02T03:55:23Z

@ishandhanani could you review and merge? thanks

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml`:
- Around line 103-106: The EAGLE speculative config is missing the required
speculative-draft-model-path parameter; add a new key named
speculative-draft-model-path under the same block where speculative-algorithm:
"EAGLE" is set and assign it a valid EAGLE draft model identifier (e.g.,
"jamesliu/sglang-EAGLE3-Llama-...") so that speculative decoding parameters
(speculative-num-steps, speculative-eagle-topk, speculative-num-draft-tokens)
can function.

In `@recipes/h200/8k1k/bs128-agg-tp-mtp.yaml`:
- Around line 27-31: The config sets trust-remote-code: true which allows
executing arbitrary remote model code; confirm whether this deployment trusts
the remote repo or change to trust-remote-code: false and instead pin the model
to a specific commit/sha (or add a checksum) to avoid remote code execution;
update the recipe entry for served-model-name / model-path to either set
trust-remote-code: false or add a clear comment and a pinned ref (e.g., commit
tag or digest) and include a short justification in the PR so reviewers know
this was an intentional security decision.
- Around line 44-65: The max-running-requests value (max-running-requests: 32)
limits throughput below the highest benchmark points in benchmark.concurrencies
("1x4x16x32x64x128x256"); update max-running-requests to at least the highest
concurrency (e.g., 256) so it doesn't clip results, or alternatively trim
benchmark.concurrencies to remove values above 32; locate the keys
max-running-requests and benchmark.concurrencies in the YAML (symbols:
max-running-requests and benchmark.concurrencies) and make the corresponding
change so the cap and concurrency list are aligned.

In `@recipes/h200/8k1k/bs4-1p7d-mtp.yaml`:
- Line 45: Remove the trailing whitespace after the value for the ep-size
setting so the line reads "ep-size: 1" with no extra space; locate the line
containing the key "ep-size" in the recipe (ep-size: 1) and trim the trailing
space to match file formatting consistency.

🧹 Nitpick comments (1)

recipes/h200/1k1k/low-latency-1p9d-mtp.yaml (1)
43-45: Minor formatting: trailing whitespace and extra blank line.

Line 45 has trailing whitespace after ep-size: 1, and there's an extra blank line at 57. These don't affect functionality but could be cleaned up for consistency.
Suggested diff
       # Parallelism
       tp-size: 8
       dp-size: 1
-      ep-size: 1 
+      ep-size: 1

       # KV cache and attention
       attention-backend: "flashinfer"

       # Radix cache disabled
       disable-radix-cache: true

       # Other flags
       # stream-interval: 50
       max-running-requests: 256
-      
Also applies to: 56-57

coderabbitai · 2026-02-05T08:48:08Z

+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

SGLang v0.5.8 EAGLE speculative decoding support

💡 Result:

Yes—SGLang v0.5.8 (PyPI upload: 2026‑01‑23) includes EAGLE-based speculative decoding support (both EAGLE‑2 and EAGLE‑3) and documents how to run it via server flags. [1][2]

How to enable (server CLI)

EAGLE‑2: set --speculative-algorithm EAGLE

EAGLE‑3: set --speculative-algorithm EAGLE3

Provide a draft model via required: --speculative-draft-model-path ...
Key tuning flags (apply to EAGLE‑2/‑3): --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens. [2]

Example (EAGLE‑3):

python3 -m sglang.launch_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \ --speculative-num-steps 5 \ --speculative-eagle-topk 8 \ --speculative-num-draft-tokens 32

[2]

EAGLE‑2 “FR-Spec” token-map optimization

For EAGLE‑2, SGLang also documents an additional optimization enabled with:

--speculative-token-map ... [2]

Notes / compatibility

SGLang’s docs note speculative decoding is compatible with radix cache and chunked prefill. [3]

Sources: PyPI release page for v0.5.8 [1]; SGLang speculative decoding docs [2][3].

Add required speculative-draft-model-path parameter for EAGLE configuration.

SGLang v0.5.8 supports EAGLE speculative decoding and the specified parameters (speculative-num-steps, speculative-eagle-topk, speculative-num-draft-tokens). However, the configuration is missing the required speculative-draft-model-path parameter, which must point to a valid EAGLE draft model (e.g., jamesliu/sglang-EAGLE3-Llama-...). Without it, the EAGLE speculative decoding will not function.

🤖 Prompt for AI Agents

In `@recipes/h200/8k1k/bs128-1p1d-dep-mtp.yaml` around lines 103 - 106, The EAGLE speculative config is missing the required speculative-draft-model-path parameter; add a new key named speculative-draft-model-path under the same block where speculative-algorithm: "EAGLE" is set and assign it a valid EAGLE draft model identifier (e.g., "jamesliu/sglang-EAGLE3-Llama-...") so that speculative decoding parameters (speculative-num-steps, speculative-eagle-topk, speculative-num-draft-tokens) can function.

coderabbitai · 2026-02-05T08:48:08Z

+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+


⚠️ Potential issue | 🟠 Major

Confirm trust-remote-code: true is acceptable for this deployment.

Line 30 enables execution of remote model code, which is a meaningful security posture risk if the model repo isn’t tightly controlled or pinned.

🤖 Prompt for AI Agents

In `@recipes/h200/8k1k/bs128-agg-tp-mtp.yaml` around lines 27 - 31, The config sets trust-remote-code: true which allows executing arbitrary remote model code; confirm whether this deployment trusts the remote repo or change to trust-remote-code: false and instead pin the model to a specific commit/sha (or add a checksum) to avoid remote code execution; update the recipe entry for served-model-name / model-path to either set trust-remote-code: false or add a clear comment and a pinned ref (e.g., commit tag or digest) and include a short justification in the PR so reviewers know this was an intentional security decision.

coderabbitai · 2026-02-05T08:48:08Z

+      max-running-requests: 32  # sum of all dp
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-prefill-tokens: 32768
+      chunked-prefill-size: 32768
+
+      # CUDA graphs
+      cuda-graph-max-bs: 32
+
+      # MTP settings
+      speculative-algorithm: "EAGLE"
+      speculative-num-steps: 2
+      speculative-eagle-topk: 1
+      speculative-num-draft-tokens: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1x4x16x32x64x128x256"
+  req_rate: "inf"


⚠️ Potential issue | 🟠 Major

Align max-running-requests with benchmark concurrencies.

Line 44 caps running requests at 32, but Line 64 includes concurrencies up to 256. That will cap/flatten higher points and distort the Pareto curve. Either raise the limit or trim the concurrency list to ≤32.

🔧 Suggested adjustment (trim concurrencies)

- concurrencies: "1x4x16x32x64x128x256" + concurrencies: "1x4x16x32"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

max-running-requests: 32 # sum of all dp

# Memory and token limits

mem-fraction-static: 0.75

max-prefill-tokens: 32768

chunked-prefill-size: 32768

# CUDA graphs

cuda-graph-max-bs: 32

# MTP settings

speculative-algorithm: "EAGLE"

speculative-num-steps: 2

speculative-eagle-topk: 1

speculative-num-draft-tokens: 3

benchmark:

type: "sa-bench"

isl: 8192

osl: 1024

concurrencies: "1x4x16x32x64x128x256"

req_rate: "inf"

max-running-requests: 32 # sum of all dp

# Memory and token limits

mem-fraction-static: 0.75

max-prefill-tokens: 32768

chunked-prefill-size: 32768

# CUDA graphs

cuda-graph-max-bs: 32

# MTP settings

speculative-algorithm: "EAGLE"

speculative-num-steps: 2

speculative-eagle-topk: 1

speculative-num-draft-tokens: 3

benchmark:

type: "sa-bench"

isl: 8192

osl: 1024

concurrencies: "1x4x16x32"

req_rate: "inf"

🤖 Prompt for AI Agents

In `@recipes/h200/8k1k/bs128-agg-tp-mtp.yaml` around lines 44 - 65, The max-running-requests value (max-running-requests: 32) limits throughput below the highest benchmark points in benchmark.concurrencies ("1x4x16x32x64x128x256"); update max-running-requests to at least the highest concurrency (e.g., 256) so it doesn't clip results, or alternatively trim benchmark.concurrencies to remove values above 32; locate the keys max-running-requests and benchmark.concurrencies in the YAML (symbols: max-running-requests and benchmark.concurrencies) and make the corresponding change so the cap and concurrency list are aligned.

coderabbitai · 2026-02-05T08:48:08Z

+      # Parallelism
+      tp-size: 8
+      dp-size: 1
+      ep-size: 1 


⚠️ Potential issue | 🟡 Minor

Minor: Trailing whitespace.

Line 45 has a trailing space after the value. While this shouldn't affect parsing, it's inconsistent with the rest of the file.

Suggested fix

- ep-size: 1 + ep-size: 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ep-size: 1

ep-size: 1

🤖 Prompt for AI Agents

In `@recipes/h200/8k1k/bs4-1p7d-mtp.yaml` at line 45, Remove the trailing whitespace after the value for the ep-size setting so the line reads "ep-size: 1" with no extra space; locate the line containing the key "ep-size" in the recipe (ep-size: 1) and trim the trailing space to match file formatting consistency.

…ons (#120) * Add MTP recipe variants for H200 8k1k configurations Add speculative decoding (MTP) versions for all H200 8k1k recipes: - bs128-1p1d-dep-mtp.yaml - bs128-agg-tp-mtp.yaml - bs16-1p3d-mtp.yaml - bs4-1p7d-mtp.yaml - bs64-2p3d-mtp.yaml - bs8-1p6d-mtp.yaml MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params. * Add MTP recipe variants for H200 1k1k configurations Add speculative decoding (MTP) versions for all H200 1k1k recipes: - bs128-agg-tp-mtp.yaml - bs256-1p6d-dep-mtp.yaml - bs256-1p6d-tp-mtp.yaml - low-latency-1p9d-mtp.yaml MTP config adds SGLANG_ENABLE_SPEC_V2 env var and speculative-* params. * Reduce batch sizes for MTP configs based on TRTLLM patterns Adjust MTP configurations to use smaller batch sizes compared to non-MTP: - 1k1k: MTP batch size = STP / 4 (512→128, 256→64) - 8k1k: MTP batch size = STP / 8 (256→32, 128→16, 32→4, 16→2, 8→2) Also reduce mem-fraction-static from 0.82-0.88 to 0.75-0.80 to leave more memory for MTP draft workers and avoid CUDA OOM errors. This matches the pattern used in trtllm/h200 MTP configs where batch sizes are significantly reduced to accommodate speculative decoding. * Expand concurrency ranges and tune MTP memory settings for 8k1k recipes - Extend benchmark concurrencies to cover wider range for better Pareto curves - Increase mem-fraction-static, max-running-requests, and cuda-graph-max-bs for MTP configs to improve throughput * Revert non mtp * Update h200 mtp recipes * Update h200 mtp recipes --------- Co-authored-by: ishandhanani <ishandhanani@gmail.com>

xutizhou added 4 commits January 30, 2026 03:58

Expand concurrency ranges and tune MTP memory settings for 8k1k recipes

339d084

- Extend benchmark concurrencies to cover wider range for better Pareto curves - Increase mem-fraction-static, max-running-requests, and cuda-graph-max-bs for MTP configs to improve throughput

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

ishandhanani added 4 commits February 5, 2026 00:39

Revert non mtp

d820abf

Merge remote-tracking branch 'origin/main' into xutizhou/main

9c36c45

Update h200 mtp recipes

ee525dd

Update h200 mtp recipes

2e72bef

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

ishandhanani merged commit db11108 into ishandhanani:main Feb 5, 2026
1 check was pending

coderabbitai bot mentioned this pull request Feb 5, 2026

Add new LHS datapoint for SGL-GB200-FP8-1k1k #148

Merged

coderabbitai bot mentioned this pull request Apr 3, 2026

Update Qwen3.5 recipes #240

Merged

Conversation

xutizhou commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan

Results

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

nvpohanh commented Feb 2, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

How to enable (server CLI)

EAGLE‑2 “FR-Spec” token-map optimization

Notes / compatibility

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xutizhou commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading