Add H100 FP8 SGLang Disaggregated Recipes for DSR1 by liz-badada · Pull Request #121 · ishandhanani/srt-slurm

liz-badada · 2026-01-30T12:39:32Z

Add H100 FP8 SGLang Disaggregated Recipes for DeepSeek-R1

Recipes Added

1k1k

Config	Mode	P:D Ratio	Total Nodes	Total GPUs	Prefill	Decode
h100-fp8-1p1d-max-dep-mtp	MTP (EAGLE)	1P:1D	4	32	TP-only (tp=16)	DEP (tp=dp=ep=16)
h100-fp8-1p2d-max-tp-mtp	MTP (EAGLE)	1P:2D	6	48	TP-only (tp=16)	TP-only (tp=16)
h100-fp8-1p1d-max-dep	STP	1P:1D	4	32	TP-only (tp=16)	DEP (tp=dp=ep=16)
h100-fp8-1p2d-max-tp	STP	1P:2D	6	48	TP-only (tp=16)	TP-only (tp=16)

8k1k

Config	Mode	P:D Ratio	Total Nodes	Total GPUs	Prefill	Decode
h100-fp8-1p1d-max-dep-mtp	MTP (EAGLE)	1P:1D	4	32	TP-only (tp=16)	DEP (tp=dp=ep=16)
h100-fp8-1p1d-max-tp-mtp	MTP (EAGLE)	1P:1D	4	32	TP-only (tp=16)	TP-only (tp=16)
h100-fp8-1p1d-max-dep	STP	1P:1D	4	32	TP-only (tp=16)	DEP (tp=dp=ep=16)
h100-fp8-1p1d-max-tp	STP	1P:1D	4	32	TP-only (tp=16)	TP-only (tp=16)

1k8k

Config	Mode	P:D Ratio	Total Nodes	Total GPUs	Prefill	Decode
h100-fp8-1p1d-max-dep-mtp	MTP (EAGLE)	1P:1D	4	32	TP-only (tp=16)	DEP (tp=dp=ep=16)
h100-fp8-1p2d-max-tp-mtp	MTP (EAGLE)	1P:2D	6	48	TP-only (tp=16)	TP-only (tp=16)
h100-fp8-1p1d-max-dep	STP	1P:1D	4	32	TP-only (tp=16)	DEP (tp=dp=ep=16)
h100-fp8-1p2d-max-tp	STP	1P:2D	6	48	TP-only (tp=16)	TP-only (tp=16)

Key Settings

Container: lmsysorg/sglang:v0.5.8-cu130-runtime
Model: DeepSeek-R1 FP8
Backend: FlashInfer attention
Transfer: NIXL disaggregation
Container: lmsysorg/sglang:v0.5.8-cu130-runtime
Model: DeepSeek-R1 FP8
Backend: FlashInfer attention
Transfer: NIXL disaggregation
MTP (EAGLE): speculative-num-draft-tokens: 3, speculative-num-steps: 2
STP: Single Token Prediction (no speculative decoding)
DEP: Data+Expert Parallelism，tp=16, dp=16, ep=16, enable-dp-attention=true
TP-only: Tensor Parallelism，tp=16, dp=1, ep=1, enable-dp-attention=false
1P:1D: 1 prefill worker (2 nodes) + 1 decode worker (2 nodes)
1P:2D: 1 prefill worker (2 nodes) + 2 decode workers (4 nodes)

Summary by CodeRabbit

Chores
- Added multiple new H100 FP8 deployment recipes covering various sequence-length, node/worker and GPU configurations with per-phase prefill/decode resource, memory and disaggregation controls, cache/workspace flags, and benchmark presets.
New Features
- Introduced multi-token prediction (MTP) and speculative decoding (EAGLE) parameters for tuning throughput (speculative steps/top-k/draft tokens) and improved decode/prefill orchestration.

coderabbitai · 2026-01-30T12:39:41Z

📝 Walkthrough

Walkthrough

Adds 12 new H100 FP8 YAML deployment recipes covering 1k1k, 1k8k, and 8k1k input/output variants. Each file defines model metadata, per-phase resources and environment, separate prefill/decode sglang_config (parallelism, disaggregation, cache), optional MTP/EAGLE parameters, and benchmark settings.

Changes

Cohort / File(s)	Summary
1k1k configs `recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml`, `recipes/h100/1k1k/stp/h100-fp8-1p2d-max-tp.yaml`, `recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`, `recipes/h100/1k1k/mtp/h100-fp8-1p2d-max-tp-mtp.yaml`	New deployment recipes for 1k/1k workloads: model/container/precision, per-phase env, prefill/decode sglang_config (tp/dp/ep, attention backend, radix-cache, disaggregation, mem/token limits), MTP where present, and benchmarks.
1k8k configs `recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml`, `recipes/h100/1k8k/stp/h100-fp8-1p2d-max-tp.yaml`, `recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`, `recipes/h100/1k8k/mtp/h100-fp8-1p2d-max-tp-mtp.yaml`	New deployment recipes tuned for 1k input / 8k output: similar per-phase settings with larger mem-fraction/max-running-requests/cuda-graph limits, disaggregation settings, optional MTP, and benchmark profiles.
8k1k configs `recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml`, `recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml`, `recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml`, `recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`	New deployment recipes for 8k input / 1k output: model/resource declarations, prefill/decode env and sglang_config, parallelism and attention flags, disabled radix-cache, disaggregation modes/ports/transfer-backend, MTP parameters, and benchmarks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix sglang confusing path resolution behavior for models and adds output path for logs #98: Overlaps on sglang model-path/container handling and path-resolution adjustments referenced in these recipes.
Update MTP recipe with multiple draft steps #78: Related MTP/speculative parameter changes (speculative-num-steps, draft tokens) used in these recipes.

Suggested reviewers

kyleliang-nv
trevor-m

Poem

🐰 New recipes hop across the lane,
FP8 H100 seeds the train,
Prefill, decode in tidy rows,
MTP drafts nudge the throughput grows,
A rabbit twitches, configs reign.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Add H100 FP8 SGLang Disaggregated Recipes for DSR1' directly and concisely summarizes the main change: adding H100 FP8 SGLang disaggregated recipes for DeepSeek-R1, matching the 12 new YAML configuration files across three sequence-size groups.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`:
- Around line 58-61: The recipe's max-prefill-tokens is too low for 8k inputs
(benchmark.isl: 8192); update the max-prefill-tokens value in the h100 8k1k
recipe files (e.g., the mtp/stp YAMLs that define mem-fraction-static,
max-prefill-tokens, chunked-prefill-size) to at least 8192 so 8192-token
requests are accepted (also consider aligning chunked-prefill-size to 8192 or a
suitable chunking strategy to avoid premature truncation).

In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml`:
- Around line 56-59: The max-prefill-tokens setting is too low for the 8k
benchmark (benchmark.isl = 8192); update the value of max-prefill-tokens in the
recipe to be at least 8192 (e.g., set max-prefill-tokens: 8192) so prefill does
not reject or truncate 8k requests; verify related settings like
chunked-prefill-size if needed to support the larger prefill window.

🧹 Nitpick comments (1)

recipes/h100/1k8k/stp/h100-fp8-1p2d-max-tp.yaml (1)
39-39: Minor: Trailing whitespace.

Line 39 has trailing whitespace after 1.
🧹 Proposed fix
-      ep-size: 1 
+      ep-size: 1

coderabbitai · 2026-02-03T08:39:43Z

+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "h100-fp8-1p1d-max-dep-mtp.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 230

🏁 Script executed:

cat -n recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 3505

🏁 Script executed:

# Search for benchmark.isl settings in the same directory grep -r "benchmark.isl" recipes/h100/8k1k/mtp/ || echo "Not found in mtp directory"

Repository: ishandhanani/srt-slurm

Length of output: 92

🏁 Script executed:

# Search for benchmark.isl in the parent or related directories find recipes/h100/8k1k -name "*.yaml" -exec grep -l "benchmark.isl" {} \;

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

# Check the 1k1k variant sed -n '58,111p' recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 1341

🏁 Script executed:

# Check the 1k8k variant sed -n '58,111p' recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 1341

🏁 Script executed:

# Search for documentation or usage comments about max-prefill-tokens grep -r "max-prefill-tokens" . --include="*.yaml" --include="*.md" --include="*.py" -B2 -A2

Repository: ishandhanani/srt-slurm

Length of output: 11690

🏁 Script executed:

find recipes/h100/8k1k -name "*.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 267

🏁 Script executed:

# Check all h100 8k1k files for their benchmark isl values for file in $(find recipes/h100/8k1k -name "*.yaml" -type f); do echo "=== $file ===" grep -A2 "benchmark:" "$file" | head -5 grep "max-prefill-tokens:" "$file" done

Repository: ishandhanani/srt-slurm

Length of output: 591

Increase max-prefill-tokens to support 8k benchmark input.

The 8k1k recipe has benchmark.isl: 8192 but max-prefill-tokens: 2048, which will reject or truncate 8k requests. Update to at least 8192.

Proposed fix

max-prefill-tokens: 8192

This affects all h100 8k1k recipe variants: both mtp and stp modes (4 files total).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Memory and token limits

mem-fraction-static: 0.6

max-prefill-tokens: 2048

chunked-prefill-size: 2048

# Memory and token limits

mem-fraction-static: 0.6

max-prefill-tokens: 8192

chunked-prefill-size: 2048

🤖 Prompt for AI Agents

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 58 - 61, The recipe's max-prefill-tokens is too low for 8k inputs (benchmark.isl: 8192); update the max-prefill-tokens value in the h100 8k1k recipe files (e.g., the mtp/stp YAMLs that define mem-fraction-static, max-prefill-tokens, chunked-prefill-size) to at least 8192 so 8192-token requests are accepted (also consider aligning chunked-prefill-size to 8192 or a suitable chunking strategy to avoid premature truncation).

coderabbitai · 2026-02-03T08:39:43Z

+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd "h100-fp8-1p1d-max-dep.yaml" --type f

Repository: ishandhanani/srt-slurm

Length of output: 212

🏁 Script executed:

cat -n recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '50,110p'

Repository: ishandhanani/srt-slurm

Length of output: 1807

🏁 Script executed:

rg "max-prefill-tokens" recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml -n

Repository: ishandhanani/srt-slurm

Length of output: 99

🏁 Script executed:

rg "max-prefill-tokens" recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml -n

Repository: ishandhanani/srt-slurm

Length of output: 231

🏁 Script executed:

rg "^ benchmark:" -A 4 recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

cat -n recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml | tail -20

Repository: ishandhanani/srt-slurm

Length of output: 672

🏁 Script executed:

cat -n recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml | tail -20

Repository: ishandhanani/srt-slurm

Length of output: 672

Raise max-prefill-tokens to cover the 8k benchmark input.

benchmark.isl is 8192, but max-prefill-tokens is 2048, so prefill will reject or truncate 8k requests in this 8k1k recipe. Set it to >= 8192.

✅ Proposed fix

- max-prefill-tokens: 2048 + max-prefill-tokens: 8192

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Memory and token limits

mem-fraction-static: 0.6

max-prefill-tokens: 2048

chunked-prefill-size: 2048

# Memory and token limits

mem-fraction-static: 0.6

max-prefill-tokens: 8192

chunked-prefill-size: 2048

🤖 Prompt for AI Agents

In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml` around lines 56 - 59, The max-prefill-tokens setting is too low for the 8k benchmark (benchmark.isl = 8192); update the value of max-prefill-tokens in the recipe to be at least 8192 (e.g., set max-prefill-tokens: 8192) so prefill does not reject or truncate 8k requests; verify related settings like chunked-prefill-size if needed to support the larger prefill window.

* update * Add H00 FP8 1k/8k configs * update * update * update * update * update * update * update h100 container and dg --------- Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com>

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`:
- Around line 77-81: The DEP parallelism is misconfigured: keep tp-size: 16 and
enable-dp-attention: true but change dp-size from 16 to 2 and ep-size from 16 to
8 so the configuration matches a 2-node × 8-GPU per node layout (tp-size: 16,
dp-size: 2, ep-size: 8); update the YAML entries for tp-size, dp-size, and
ep-size accordingly.

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`:
- Around line 3-6: Update the model container tag to the recommended runtime
variant by replacing the current container value "lmsysorg/sglang:v0.5.8-cu130"
with "lmsysorg/sglang:v0.5.8-cu130-runtime" in the model definition (look for
the container: field in the YAML where the string "lmsysorg/sglang:v0.5.8-cu130"
appears).
- Around line 36-40: Prefill parallelism currently sets tp-size: 16, dp-size: 1,
ep-size: 1, enable-dp-attention: false which mismatches the decode phase and
violates SGLang PD disaggregation best practices; update the prefill section to
use the same DP/EP and DP-attention settings as decode (e.g., set dp-size: 16,
ep-size: 16 and enable-dp-attention: true) so phase-homogeneous DP-attention
groups are maintained (adjust tp-size only if decode differs).

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml`:
- Around line 56-59: The max-prefill-tokens value is too low for isl: 8192;
update the max-prefill-tokens setting (currently 2048) to at least 8192 so 8k
inputs won't be rejected or truncated, keep chunked-prefill-size at 2048, and
make the same change for the other max-prefill-tokens occurrence in this recipe;
reference the mem-fraction-static, max-prefill-tokens, and chunked-prefill-size
entries when locating the lines to edit.
- Around line 1-6: The recipe sets max-prefill-tokens to 2048 while the
benchmark input sequence length (isl) is 8192, causing a 4× mismatch that can
reject or stall prefill; update the max-prefill-tokens value (the YAML key
max-prefill-tokens) to at least 8192 or implement/confirm chunked prefill
behavior that safely handles isl=8192 (adjust associated prefill/chunking logic
if present in the code paths that read max-prefill-tokens and isl); verify the
change by ensuring the configuration (max-prefill-tokens) and the benchmark
parameter (isl) in this recipe are consistent.

In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml`:
- Around line 54-57: The max-prefill-tokens setting is too low for the 8k
benchmark (benchmark.isl = 8192); update the max-prefill-tokens value in the
YAML (parameter name: max-prefill-tokens) to at least 8192 (or higher if you
want headroom) so 8k prefill requests are not rejected or truncated, and keep
chunked-prefill-size aligned if needed.

coderabbitai · 2026-02-04T20:20:56Z

+      # Parallelism
+      tp-size: 16
+      dp-size: 16
+      ep-size: 16
+      enable-dp-attention: true


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

SGLang DEP Data Expert Parallelism dp-attention enable configuration

💡 Result:

To enable DEP (Data + Expert Parallelism) in SGLang for MoE models (e.g., DeepSeek V3/R1), you typically combine:

DP-Attention (data-parallel attention to reduce KV-cache duplication)

EP (expert parallelism for MoE experts)

1) Enable DP-Attention (dp-attention)

Add --enable-dp-attention and set --tp and --dp:

Single node (example: 8 GPUs):

python -m sglang.launch_server \ --model-path deepseek-ai/DeepSeek-V3 \ --tp 8 --dp 8 --enable-dp-attention

(Common high-throughput config; DP attention helps especially at large batch sizes.) [1]

Multi-node hybrid example (2 nodes × 8 GPUs = 16 GPUs total):

python -m sglang.launch_server \ --model-path deepseek-ai/DeepSeek-V3 \ --tp 16 --dp 2 --enable-dp-attention

This means 2 DP groups, each with 8 GPUs for attention TP. [1]

Notes:

DP-attention is not recommended for low-latency / small-batch serving. [1]

Some docs also state a constraint that “dp size should be equal to tp size” for --enable-dp-attention; treat this as a model/feature-specific limitation and prefer the per-model DeepSeek guidance above when applicable. [1][2]

2) Enable Expert Parallelism (EP) for MoE

Set --ep and (optionally) choose EP backends:

python -m sglang.launch_server \ --model-path deepseek-ai/DeepSeek-V3 \ --tp 8 --ep 8 \ --moe-a2a-backend deepep \ --moe-runner-backend deep_gemm

(Example from SGLang EP docs.) [3]

Important EP constraint:

For deepep / mooncake / some other A2A backends, SGLang currently supports ep_size = tp_size (within the expert-parallel group). [3]

3) Putting it together (DEP = DP-attention + EP) on 16 GPUs (2×8)

A common pattern is:

--tp 16 --dp 2 --enable-dp-attention (attention uses 2 DP groups × 8-way TP) [1]

EP runs within each DP group, so set --ep 8 and use EP backends that require ep=tp_per_group (8 here). [3]

Skeleton:

python -m sglang.launch_server \ --model-path deepseek-ai/DeepSeek-V3 \ --tp 16 --dp 2 --enable-dp-attention \ --ep 8 --moe-a2a-backend deepep --moe-runner-backend deep_gemm

4) Extra: DP-attention recommended defaults for DeepSeek V3.2

For DeepSeek V3.2, SGLang notes kernels are tuned around dp_size=8, recommending:

--dp 8 --enable-dp-attention

for stability/perf in their tests. [4]

Sources: [1] DeepSeek V3/V3.1/R1 Usage (SGLang docs), [2] Server Arguments (SGLang docs), [3] Expert Parallelism (SGLang docs), [4] DeepSeek V3.2 Usage (SGLang docs.

Correct DEP parallelism settings for 16-GPU allocation.

The decode configuration uses tp-size: 16, dp-size: 16, ep-size: 16 on 16 GPUs, but this deviates from SGLang's documented DEP pattern. For a 2-node × 8-GPU setup with enable-dp-attention, SGLang's recommended configuration is:

tp: 16, dp: 2 (2 DP groups, each with 8-way TP)

ep: 8 (expert parallelism within each DP group)

With 16 total GPUs, dp=16 is not feasible—it would require 16 model replicas across 16 GPUs. Change to dp-size: 2 and ep-size: 8 to align with SGLang's DEP semantics and DeepSeek's tuned defaults.

🤖 Prompt for AI Agents

In `@recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 77 - 81, The DEP parallelism is misconfigured: keep tp-size: 16 and enable-dp-attention: true but change dp-size from 16 to 2 and ep-size from 16 to 8 so the configuration matches a 2-node × 8-GPU per node layout (tp-size: 16, dp-size: 2, ep-size: 8); update the YAML entries for tp-size, dp-size, and ep-size accordingly.

coderabbitai · 2026-02-04T20:20:57Z

+model:
+  path: "dsfp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

lmsysorg/sglang v0.5.8 docker image tags runtime cu130

💡 Result:

For SGLang v0.5.8 with the CUDA 13.0 (“cu130”) runtime variant, the tag you want is:

lmsysorg/sglang:v0.5.8-cu130-runtime (pinned to the v0.5.8 release) [3]

Related / commonly used aliases:

lmsysorg/sglang:latest-cu130-runtime (moving “stable” CUDA 13.0 runtime tag recommended in the install docs) [1]

Tag naming pattern confirmation (example from the previous release):

lmsysorg/sglang:v0.5.7-cu130-runtime exists on Docker Hub mirrors, matching the vX.Y.Z-cu130-runtime convention [2]

Sources: SGLang install docs (CUDA 13.0 runtime recommendation) [1], Docker image metadata for v0.5.7-cu130-runtime [2], SGLang v0.5.8 release tag (confirms v0.5.8 exists) [3].

Update container tag to the recommended runtime variant.

Line 5 uses v0.5.8-cu130, but the recommended tag for this release is lmsysorg/sglang:v0.5.8-cu130-runtime. The runtime variant is smaller and is the standard recommendation in the SGLang install documentation.

Proposed fix

- container: "lmsysorg/sglang:v0.5.8-cu130" + container: "lmsysorg/sglang:v0.5.8-cu130-runtime"

🤖 Prompt for AI Agents

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 3 - 6, Update the model container tag to the recommended runtime variant by replacing the current container value "lmsysorg/sglang:v0.5.8-cu130" with "lmsysorg/sglang:v0.5.8-cu130-runtime" in the model definition (look for the container: field in the YAML where the string "lmsysorg/sglang:v0.5.8-cu130" appears).

coderabbitai · 2026-02-04T20:20:57Z

+      # Parallelism
+      tp-size: 16
+      dp-size: 1
+      ep-size: 1
+      enable-dp-attention: false


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "h100-fp8-1p1d-max-dep-mtp.yaml" 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 230

🏁 Script executed:

# Search for recipe files in h100 directory to understand structure fd "max-dep" --type f 2>/dev/null | head -20

Repository: ishandhanani/srt-slurm

Length of output: 371

🏁 Script executed:

# Look at the specific file mentioned in the review cat -n recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml 2>/dev/null || find . -path "*recipes*max-dep*" -type f 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 3357

🏁 Script executed:

# Check the 1k8k max-dep-mtp variant cat -n recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1495

🏁 Script executed:

# Check the 1k1k max-dep-mtp variant cat -n recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1489

🏁 Script executed:

# Check the 8k1k stp (non-MTP) max-dep variant cat -n recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1430

🏁 Script executed:

# Check the 1k8k stp variant cat -n recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1430

🏁 Script executed:

# Check the 1k1k stp variant cat -n recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1430

🏁 Script executed:

# Search for any documentation or comments about prefill/decode parallelism strategy rg -i "prefill.*parallelism|tp-only|disaggregat" --type yaml -A 2 -B 2 recipes/ | head -50

Repository: ishandhanani/srt-slurm

Length of output: 3667

🏁 Script executed:

# Look for README or documentation files that explain the recipe design find . -type f $ -name "README*" -o -name "*.md" $ -exec grep -l "max-dep\|prefill\|decode" {} \; 2>/dev/null | head -10

Repository: ishandhanani/srt-slurm

Length of output: 268

🏁 Script executed:

# Check CLAUDE.md for relevant information cat CLAUDE.md | grep -i -A 5 -B 5 "max-dep\|prefill.*parallelism\|decode.*parallelism" | head -100

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

# Check config reference for parallelism guidance cat docs/config-reference.md | grep -i -A 3 -B 3 "dp-size\|enable-dp-attention\|disaggregat" | head -80

Repository: ishandhanani/srt-slurm

Length of output: 3109

🏁 Script executed:

# Check architecture documentation cat docs/architecture.md | grep -i -A 5 -B 5 "prefill\|decode\|disaggregat" | head -100

Repository: ishandhanani/srt-slurm

Length of output: 3553

🏁 Script executed:

# Search for any comments or documentation about why prefill uses TP-only rg "TP.*only|prefill.*parallelism|decode.*parallelism" --type yaml --type md -B 2 -A 2 | head -80

Repository: ishandhanani/srt-slurm

Length of output: 476

🏁 Script executed:

# Check if there are any non-max-dep recipes with different prefill/decode parallelism patterns fd "\.yaml$" recipes/ -x grep -l "dp-size.*16" {} \; | head -20

Repository: ishandhanani/srt-slurm

Length of output: 399

🏁 Script executed:

# Look for any references to prefill-specific optimization or reasoning find . -type f -name "*.md" -o -name "*.py" | xargs grep -i "prefill.*parallelism\|why.*prefill\|tp.*only" 2>/dev/null | head -30

Repository: ishandhanani/srt-slurm

Length of output: 725

🏁 Script executed:

# Check if there's a blog post that explains the max-dep strategy cat recipes/h100/blog-post.yaml | head -100

Repository: ishandhanani/srt-slurm

Length of output: 2659

🏁 Script executed:

# Look for any issue tracker or comments discussing prefill parallelism strategy find . -type f $ -name "*.md" -o -name "*.txt" $ -exec grep -l "prefill.*parallelism\|prefill.*tp.*only\|decode.*expert" {} \; 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

# Check git history or comments for when max-dep recipes were created git log --oneline recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml 2>/dev/null | head -10

Repository: ishandhanani/srt-slurm

Length of output: 102

🏁 Script executed:

# Search for any discussion or rationale in commit messages or issues find . -type f -name "*.md" -o -name "*.txt" -o -name "NOTES*" | xargs grep -i "max-dep\|prefill.*tp.*only" 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 50

🏁 Script executed:

# Compare the blog-post config with max-dep recipes more carefully echo "=== BLOG POST PREFILL ===" && cat recipes/h100/blog-post.yaml | sed -n '48,65p' && echo -e "\n=== MAX-DEP PREFILL ===" && cat recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml | sed -n '36,40p'

Repository: ishandhanani/srt-slurm

Length of output: 713

🏁 Script executed:

# Check when max-dep recipes were created vs blog-post git log --oneline recipes/h100/blog-post.yaml 2>/dev/null | head -5

Repository: ishandhanani/srt-slurm

Length of output: 102

🏁 Script executed:

# Look for any SGLang documentation or issues about prefill parallelism strategy rg "prefill.*expert\|expert.*parallel.*prefill\|disable.*dp.*attention" --type md --type yaml -B 2 -A 2 | head -60

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

# Check if there's any README in the h100 recipes directory cat recipes/h100/README.md 2>/dev/null || find recipes/h100 -maxdepth 1 -name "*.md" -o -name "*.txt"

Repository: ishandhanani/srt-slurm

Length of output: 48

🌐 Web query:

SGLang disaggregated prefill decode parallelism strategy dp-size expert parallel

💡 Result:

In SGLang PD (Prefill/Decode) disaggregation, the core idea is to separate the compute-bound prefill stage from the memory-/KV-cache-bound decode stage to avoid (1) decode being interrupted by prefill and (2) “DP attention imbalance” where different DP workers end up doing different phases at the same time, increasing decode latency. [1]

How to think about the parallelism knobs

1) --dp-size (with --enable-dp-attention)

In PD mode, you typically want prefill and decode to each run as a “homogeneous” group, so that within a DP-attention group everyone is doing prefill (on prefill workers) or decode (on decode workers), instead of mixing phases (that’s the imbalance PD is meant to eliminate). [1]

In SGLang’s DP-attention setup (as documented for DeepSeek), the attention tensor-parallel group size is effectively:
atten_tp_size = tp_size / dp_size. [2]
This matters because it changes the “shape” of the attention parallel group and any features that reuse it (e.g., some context-parallel setups).

2) Prefill-side strategy (compute-bound)

Favor parallelism that improves throughput on large prompt batches / long prompts, e.g.:

higher TP and/or PP + chunked prefill (pipeline parallelism for long context).

SGLang’s PP work explicitly targets long-context TTFT and is designed to be compatible with PD disaggregation, and the roadmap notes configurations with different TP degrees on prefill vs decode. [3]

3) Decode-side strategy (KV/memory-bound)

Scale decode capacity primarily by adding more decode workers (horizontal scale) and by using MoE-friendly parallelism when applicable.

For MoE models, SGLang recommends using Expert Parallelism (EP) to distribute expert weights and reduce MoE memory bottlenecks. [4]

If you use DeepEP/Mooncake for MoE all-to-all (--moe-a2a-backend deepep|mooncake), note the current constraint: these backends only support ep_size = tp_size (hybrid ep_size < tp_size requires the none backend). [4]

A concrete, “known-good” PD + DP-attention + MoE pattern (DeepSeek example)

SGLang’s PD disaggregation docs show multi-node DeepSeek workers launched with the same DP-attention shape on both sides (example uses --tp-size 16 --dp-size 8 --enable-dp-attention) and MoE all-to-all via DeepEP (--moe-a2a-backend deepep). [1]

Rule of thumb:

Prefill workers: optimize for compute/TTFT (TP/PP/chunking), keep DP-attention groups phase-homogeneous.

Decode workers: optimize for KV-cache / MoE memory & latency; for MoE, prefer EP, and if using DeepEP/Mooncake keep ep == tp. [1][4]

Sources: [1] PD disaggregation docs, incl. DP-attention imbalance + DeepSeek PD launch example. [2] DeepSeek V3.2 docs (atten_tp_size = tp_size / dp_size). [3] SGLang PP roadmap issue (PP compatible with PD, differing TP degrees prefill vs decode). [4] SGLang expert parallelism docs (DeepEP/Mooncake constraint ep_size = tp_size, backend flags).

Prefill parallelism settings are misaligned with SGLang PD disaggregation best practices.

The current prefill config uses TP-only (dp-size: 1, ep-size: 1, enable-dp-attention: false) while decode uses full DEP (dp-size: 16, ep-size: 16, enable-dp-attention: true). SGLang's documented PD disaggregation pattern requires prefill and decode to maintain phase-homogeneous DP-attention groups to avoid attention imbalance. The blog-post config in this repository and SGLang's official guidance both use matching DP settings across both phases (e.g., dp-size: 16 on both).

Recommended fix

- dp-size: 1 - ep-size: 1 - enable-dp-attention: false + dp-size: 16 + ep-size: 16 + enable-dp-attention: true

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Parallelism

tp-size: 16

dp-size: 1

ep-size: 1

enable-dp-attention: false

# Parallelism

tp-size: 16

dp-size: 16

ep-size: 16

enable-dp-attention: true

🤖 Prompt for AI Agents

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 36 - 40, Prefill parallelism currently sets tp-size: 16, dp-size: 1, ep-size: 1, enable-dp-attention: false which mismatches the decode phase and violates SGLang PD disaggregation best practices; update the prefill section to use the same DP/EP and DP-attention settings as decode (e.g., set dp-size: 16, ep-size: 16 and enable-dp-attention: true) so phase-homogeneous DP-attention groups are maintained (adjust tp-size only if decode differs).

coderabbitai · 2026-02-04T20:20:57Z

+name: "h100-fp8-1p1d-max-tp-mtp"
+
+model:
+  path: "dsfp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find and explore other H100 recipes to see which container tag pattern is used find recipes -name "*.yaml" -path "*/h100/*" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 699

🏁 Script executed:

# Check the full content of the file in question cat -n recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 3385

🏁 Script executed:

# Search for container tag patterns across all H100 recipes rg "container:" recipes/h100/ -A 0 | head -20

Repository: ishandhanani/srt-slurm

Length of output: 1254

🌐 Web query:

SGLang sglang v0.5.8 cu130 runtime image differences lmsysorg/sglang container tags

💡 Result:

How to read lmsysorg/sglang tags (with focus on v0.5.8 + cu130 + runtime)

1) *-runtime vs non-runtime

...-runtime images are intended for production and are significantly smaller (~40% reduction) because they exclude build tools and development dependencies. (docs.sglang.io)

Non-runtime images (e.g., latest) are “full” images that include more dev/build tooling (better for debugging or building extra deps inside the container). (docs.sglang.io)

2) What cu130 means

cu130 means the image is built for CUDA 13 (cu130). SGLang docs explicitly recommend using lmsysorg/sglang:latest-cu130-runtime for CUDA 13 environments (and dev-cu13 for nightly/dev). (docs.sglang.io)

In practice, the CUDA-13 runtime stack in these images includes CUDA 13.0.x user-space libraries (and requires a compatible NVIDIA driver on the host, as usual for CUDA containers).

3) What’s inside a cu130-runtime image (example from the official lineage)
A publicly indexed copy of lmsysorg/sglang:v0.5.7-cu130-runtime shows the typical CUDA-13 runtime base/contents used for these tags, including:

CUDA_VERSION=13.0.1

Ubuntu 24.04 base

Python 3.12

cuDNN 9.13.x (CUDA 13 build) (docker.aityp.com)
This is the cu130-runtime “family” you should expect v0.5.8-cu130-runtime to follow as well (same intent and very similar base stack), even if the exact patch versions can vary across builds.

4) dev-cu13 / nightly CUDA-13 tags
SGLang’s CUDA-13 docker workflow builds and publishes these tags:

lmsysorg/sglang:dev-x86-cu13

lmsysorg/sglang:dev-arm64-cu13

a multi-arch manifest lmsysorg/sglang:dev-cu13

plus nightly stamp tags like nightly-dev-cu13-YYYYMMDD-<sha> (github.com)
These track the latest code rather than a fixed release like v0.5.8.

5) Version-pinned tags

Tags starting with v0.5.8-... are pinned to the v0.5.8 release line (for example, docs reference lmsysorg/sglang:v0.5.8-rocm700-mi30x as a v0.5.8-specific build for that platform). (docs.sglang.io)

If you tell me which tags you’re comparing against (e.g., v0.5.8-cu130 vs v0.5.8-cu130-runtime vs latest-cu130-runtime), I can summarize the differences in a small table.

Citations:

1: https://docs.sglang.io/get_started/install.html?utm_source=openai

2: https://docs.sglang.io/get_started/install.html?utm_source=openai

3: https://docs.sglang.io/get_started/install.html?utm_source=openai

4: https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime?utm_source=openai

5: https://github.com/sgl-project/sglang/actions/runs/21652742322/workflow

6: https://docs.sglang.io/basic_usage/deepseek_v32.html?utm_source=openai

Verify max-prefill-tokens aligns with benchmark input sequence length.

The max-prefill-tokens is set to 2048 (line 58), but the benchmark uses isl: 8192 (line 108). This 4× mismatch could cause prefill to reject or stall requests during benchmarking. Confirm whether this is intentional (e.g., chunked prefill handles overflow) or if max-prefill-tokens should be increased to at least 8192.

Regarding the container tag: all other H100 recipes in the repository consistently use lmsysorg/sglang:v0.5.8-cu130 (non-runtime), so this file is aligned with the existing codebase pattern.

🤖 Prompt for AI Agents

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml` around lines 1 - 6, The recipe sets max-prefill-tokens to 2048 while the benchmark input sequence length (isl) is 8192, causing a 4× mismatch that can reject or stall prefill; update the max-prefill-tokens value (the YAML key max-prefill-tokens) to at least 8192 or implement/confirm chunked prefill behavior that safely handles isl=8192 (adjust associated prefill/chunking logic if present in the code paths that read max-prefill-tokens and isl); verify the change by ensuring the configuration (max-prefill-tokens) and the benchmark parameter (isl) in this recipe are consistent.

coderabbitai · 2026-02-04T20:20:57Z

+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048


⚠️ Potential issue | 🟠 Major

max-prefill-tokens is too low for the 8k ISL benchmark.

isl: 8192 exceeds max-prefill-tokens: 2048, so 8k inputs will likely be rejected or truncated. Bump the max to at least 8192 (chunk size can stay 2048).

✅ Suggested fix

- max-prefill-tokens: 2048 + max-prefill-tokens: 8192

Also applies to: 108-109

🤖 Prompt for AI Agents

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml` around lines 56 - 59, The max-prefill-tokens value is too low for isl: 8192; update the max-prefill-tokens setting (currently 2048) to at least 8192 so 8k inputs won't be rejected or truncated, keep chunked-prefill-size at 2048, and make the same change for the other max-prefill-tokens occurrence in this recipe; reference the mem-fraction-static, max-prefill-tokens, and chunked-prefill-size entries when locating the lines to edit.

coderabbitai · 2026-02-04T20:20:57Z

+      # Memory and token limits
+      mem-fraction-static: 0.6
+      max-prefill-tokens: 2048
+      chunked-prefill-size: 2048


⚠️ Potential issue | 🟠 Major

Raise max-prefill-tokens to cover the 8k benchmark input.

benchmark.isl is 8192 while max-prefill-tokens is 2048, so 8k requests will be rejected or truncated.

✅ Proposed fix

- max-prefill-tokens: 2048 + max-prefill-tokens: 8192

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Memory and token limits

mem-fraction-static: 0.6

max-prefill-tokens: 2048

chunked-prefill-size: 2048

# Memory and token limits

mem-fraction-static: 0.6

max-prefill-tokens: 8192

chunked-prefill-size: 2048

🤖 Prompt for AI Agents

In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml` around lines 54 - 57, The max-prefill-tokens setting is too low for the 8k benchmark (benchmark.isl = 8192); update the max-prefill-tokens value in the YAML (parameter name: max-prefill-tokens) to at least 8192 (or higher if you want headroom) so 8k prefill requests are not rejected or truncated, and keep chunked-prefill-size aligned if needed.

liz-badada added 3 commits January 29, 2026 17:38

update

6a30b5d

Merge branch 'ishandhanani:main' into h100

4b08bff

Add H00 FP8 1k/8k configs

46e4fa4

liz-badada changed the title ~~H100~~ Add H100 FP8 SGLang Recipes for DSR1 Feb 1, 2026

jychen21 and others added 4 commits February 2, 2026 02:17

update

c455741

Merge branch 'ishandhanani:main' into h100

008e77a

update

0dd58ce

update

da4538a

liz-badada changed the title ~~Add H100 FP8 SGLang Recipes for DSR1~~ Add H100 FP8 SGLang Disaggregated Recipes for DSR1 Feb 3, 2026

liz-badada marked this pull request as ready for review February 3, 2026 02:32

jychen21 and others added 2 commits February 3, 2026 08:35

update

ae8452c

Merge branch 'ishandhanani:main' into h100

42d186b

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

jychen21 and others added 4 commits February 3, 2026 15:51

update

857f453

update

e42b1c5

Merge branch 'ishandhanani:main' into h100

62632ac

update h100 container and dg

207dcd5

ishandhanani merged commit da1472d into ishandhanani:main Feb 4, 2026
1 check was pending

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Feb 5, 2026

Add MTP (Multi-Token Prediction) recipe variants for H200 configurations #120

Merged

3 tasks

coderabbitai bot mentioned this pull request Mar 25, 2026

Merge sa-submission-q1-2026 into main #228

Merged

Conversation

liz-badada commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add H100 FP8 SGLang Disaggregated Recipes for DeepSeek-R1

Recipes Added

1k1k

8k1k

1k8k

Key Settings

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

1) Enable DP-Attention (dp-attention)

2) Enable Expert Parallelism (EP) for MoE

3) Putting it together (DEP = DP-attention + EP) on 16 GPUs (2×8)

4) Extra: DP-attention recommended defaults for DeepSeek V3.2

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

How to think about the parallelism knobs

1) --dp-size (with --enable-dp-attention)

2) Prefill-side strategy (compute-bound)

3) Decode-side strategy (KV/memory-bound)

A concrete, “known-good” PD + DP-attention + MoE pattern (DeepSeek example)

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

How to read lmsysorg/sglang tags (with focus on v0.5.8 + cu130 + runtime)

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liz-badada commented Jan 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

1) `--dp-size` (with `--enable-dp-attention`)

How to read `lmsysorg/sglang` tags (with focus on v0.5.8 + cu130 + runtime)