Skip to content

Add H100 FP8 SGLang Disaggregated Recipes for DSR1#121

Merged
ishandhanani merged 13 commits intoishandhanani:mainfrom
liz-badada:h100
Feb 4, 2026
Merged

Add H100 FP8 SGLang Disaggregated Recipes for DSR1#121
ishandhanani merged 13 commits intoishandhanani:mainfrom
liz-badada:h100

Conversation

@liz-badada
Copy link
Copy Markdown
Contributor

@liz-badada liz-badada commented Jan 30, 2026

Add H100 FP8 SGLang Disaggregated Recipes for DeepSeek-R1

Recipes Added

1k1k

Config Mode P:D Ratio Total Nodes Total GPUs Prefill Decode
h100-fp8-1p1d-max-dep-mtp MTP (EAGLE) 1P:1D 4 32 TP-only (tp=16) DEP (tp=dp=ep=16)
h100-fp8-1p2d-max-tp-mtp MTP (EAGLE) 1P:2D 6 48 TP-only (tp=16) TP-only (tp=16)
h100-fp8-1p1d-max-dep STP 1P:1D 4 32 TP-only (tp=16) DEP (tp=dp=ep=16)
h100-fp8-1p2d-max-tp STP 1P:2D 6 48 TP-only (tp=16) TP-only (tp=16)

8k1k

Config Mode P:D Ratio Total Nodes Total GPUs Prefill Decode
h100-fp8-1p1d-max-dep-mtp MTP (EAGLE) 1P:1D 4 32 TP-only (tp=16) DEP (tp=dp=ep=16)
h100-fp8-1p1d-max-tp-mtp MTP (EAGLE) 1P:1D 4 32 TP-only (tp=16) TP-only (tp=16)
h100-fp8-1p1d-max-dep STP 1P:1D 4 32 TP-only (tp=16) DEP (tp=dp=ep=16)
h100-fp8-1p1d-max-tp STP 1P:1D 4 32 TP-only (tp=16) TP-only (tp=16)

1k8k

Config Mode P:D Ratio Total Nodes Total GPUs Prefill Decode
h100-fp8-1p1d-max-dep-mtp MTP (EAGLE) 1P:1D 4 32 TP-only (tp=16) DEP (tp=dp=ep=16)
h100-fp8-1p2d-max-tp-mtp MTP (EAGLE) 1P:2D 6 48 TP-only (tp=16) TP-only (tp=16)
h100-fp8-1p1d-max-dep STP 1P:1D 4 32 TP-only (tp=16) DEP (tp=dp=ep=16)
h100-fp8-1p2d-max-tp STP 1P:2D 6 48 TP-only (tp=16) TP-only (tp=16)

Key Settings

  • Container: lmsysorg/sglang:v0.5.8-cu130-runtime
  • Model: DeepSeek-R1 FP8
  • Backend: FlashInfer attention
  • Transfer: NIXL disaggregation
  • Container: lmsysorg/sglang:v0.5.8-cu130-runtime
  • Model: DeepSeek-R1 FP8
  • Backend: FlashInfer attention
  • Transfer: NIXL disaggregation
  • MTP (EAGLE): speculative-num-draft-tokens: 3, speculative-num-steps: 2
  • STP: Single Token Prediction (no speculative decoding)
  • DEP: Data+Expert Parallelism,tp=16, dp=16, ep=16, enable-dp-attention=true
  • TP-only: Tensor Parallelism,tp=16, dp=1, ep=1, enable-dp-attention=false
  • 1P:1D: 1 prefill worker (2 nodes) + 1 decode worker (2 nodes)
  • 1P:2D: 1 prefill worker (2 nodes) + 2 decode workers (4 nodes)

Summary by CodeRabbit

  • Chores

    • Added multiple new H100 FP8 deployment recipes covering various sequence-length, node/worker and GPU configurations with per-phase prefill/decode resource, memory and disaggregation controls, cache/workspace flags, and benchmark presets.
  • New Features

    • Introduced multi-token prediction (MTP) and speculative decoding (EAGLE) parameters for tuning throughput (speculative steps/top-k/draft tokens) and improved decode/prefill orchestration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

Adds 12 new H100 FP8 YAML deployment recipes covering 1k1k, 1k8k, and 8k1k input/output variants. Each file defines model metadata, per-phase resources and environment, separate prefill/decode sglang_config (parallelism, disaggregation, cache), optional MTP/EAGLE parameters, and benchmark settings.

Changes

Cohort / File(s) Summary
1k1k configs
recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml, recipes/h100/1k1k/stp/h100-fp8-1p2d-max-tp.yaml, recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml, recipes/h100/1k1k/mtp/h100-fp8-1p2d-max-tp-mtp.yaml
New deployment recipes for 1k/1k workloads: model/container/precision, per-phase env, prefill/decode sglang_config (tp/dp/ep, attention backend, radix-cache, disaggregation, mem/token limits), MTP where present, and benchmarks.
1k8k configs
recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml, recipes/h100/1k8k/stp/h100-fp8-1p2d-max-tp.yaml, recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml, recipes/h100/1k8k/mtp/h100-fp8-1p2d-max-tp-mtp.yaml
New deployment recipes tuned for 1k input / 8k output: similar per-phase settings with larger mem-fraction/max-running-requests/cuda-graph limits, disaggregation settings, optional MTP, and benchmark profiles.
8k1k configs
recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml, recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml, recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml, recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml
New deployment recipes for 8k input / 1k output: model/resource declarations, prefill/decode env and sglang_config, parallelism and attention flags, disabled radix-cache, disaggregation modes/ports/transfer-backend, MTP parameters, and benchmarks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • kyleliang-nv
  • trevor-m

Poem

🐰 New recipes hop across the lane,
FP8 H100 seeds the train,
Prefill, decode in tidy rows,
MTP drafts nudge the throughput grows,
A rabbit twitches, configs reign.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add H100 FP8 SGLang Disaggregated Recipes for DSR1' directly and concisely summarizes the main change: adding H100 FP8 SGLang disaggregated recipes for DeepSeek-R1, matching the 12 new YAML configuration files across three sequence-size groups.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@liz-badada liz-badada changed the title H100 Add H100 FP8 SGLang Recipes for DSR1 Feb 1, 2026
@liz-badada liz-badada changed the title Add H100 FP8 SGLang Recipes for DSR1 Add H100 FP8 SGLang Disaggregated Recipes for DSR1 Feb 3, 2026
@liz-badada liz-badada marked this pull request as ready for review February 3, 2026 02:32
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`:
- Around line 58-61: The recipe's max-prefill-tokens is too low for 8k inputs
(benchmark.isl: 8192); update the max-prefill-tokens value in the h100 8k1k
recipe files (e.g., the mtp/stp YAMLs that define mem-fraction-static,
max-prefill-tokens, chunked-prefill-size) to at least 8192 so 8192-token
requests are accepted (also consider aligning chunked-prefill-size to 8192 or a
suitable chunking strategy to avoid premature truncation).

In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml`:
- Around line 56-59: The max-prefill-tokens setting is too low for the 8k
benchmark (benchmark.isl = 8192); update the value of max-prefill-tokens in the
recipe to be at least 8192 (e.g., set max-prefill-tokens: 8192) so prefill does
not reject or truncate 8k requests; verify related settings like
chunked-prefill-size if needed to support the larger prefill window.
🧹 Nitpick comments (1)
recipes/h100/1k8k/stp/h100-fp8-1p2d-max-tp.yaml (1)

39-39: Minor: Trailing whitespace.

Line 39 has trailing whitespace after 1.

🧹 Proposed fix
-      ep-size: 1 
+      ep-size: 1

Comment on lines +58 to +61
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 2048
chunked-prefill-size: 2048
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "h100-fp8-1p1d-max-dep-mtp.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 230


🏁 Script executed:

cat -n recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 3505


🏁 Script executed:

# Search for benchmark.isl settings in the same directory
grep -r "benchmark.isl" recipes/h100/8k1k/mtp/ || echo "Not found in mtp directory"

Repository: ishandhanani/srt-slurm

Length of output: 92


🏁 Script executed:

# Search for benchmark.isl in the parent or related directories
find recipes/h100/8k1k -name "*.yaml" -exec grep -l "benchmark.isl" {} \;

Repository: ishandhanani/srt-slurm

Length of output: 48


🏁 Script executed:

# Check the 1k1k variant
sed -n '58,111p' recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 1341


🏁 Script executed:

# Check the 1k8k variant
sed -n '58,111p' recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 1341


🏁 Script executed:

# Search for documentation or usage comments about max-prefill-tokens
grep -r "max-prefill-tokens" . --include="*.yaml" --include="*.md" --include="*.py" -B2 -A2

Repository: ishandhanani/srt-slurm

Length of output: 11690


🏁 Script executed:

find recipes/h100/8k1k -name "*.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 267


🏁 Script executed:

# Check all h100 8k1k files for their benchmark isl values
for file in $(find recipes/h100/8k1k -name "*.yaml" -type f); do
  echo "=== $file ==="
  grep -A2 "benchmark:" "$file" | head -5
  grep "max-prefill-tokens:" "$file"
done

Repository: ishandhanani/srt-slurm

Length of output: 591


Increase max-prefill-tokens to support 8k benchmark input.

The 8k1k recipe has benchmark.isl: 8192 but max-prefill-tokens: 2048, which will reject or truncate 8k requests. Update to at least 8192.

Proposed fix
      max-prefill-tokens: 8192

This affects all h100 8k1k recipe variants: both mtp and stp modes (4 files total).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 2048
chunked-prefill-size: 2048
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 8192
chunked-prefill-size: 2048
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 58 - 61,
The recipe's max-prefill-tokens is too low for 8k inputs (benchmark.isl: 8192);
update the max-prefill-tokens value in the h100 8k1k recipe files (e.g., the
mtp/stp YAMLs that define mem-fraction-static, max-prefill-tokens,
chunked-prefill-size) to at least 8192 so 8192-token requests are accepted (also
consider aligning chunked-prefill-size to 8192 or a suitable chunking strategy
to avoid premature truncation).

Comment on lines +56 to +59
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 2048
chunked-prefill-size: 2048
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd "h100-fp8-1p1d-max-dep.yaml" --type f

Repository: ishandhanani/srt-slurm

Length of output: 212


🏁 Script executed:

cat -n recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '50,110p'

Repository: ishandhanani/srt-slurm

Length of output: 1807


🏁 Script executed:

rg "max-prefill-tokens" recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml -n

Repository: ishandhanani/srt-slurm

Length of output: 99


🏁 Script executed:

rg "max-prefill-tokens" recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml -n

Repository: ishandhanani/srt-slurm

Length of output: 231


🏁 Script executed:

rg "^  benchmark:" -A 4 recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml

Repository: ishandhanani/srt-slurm

Length of output: 48


🏁 Script executed:

cat -n recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml | tail -20

Repository: ishandhanani/srt-slurm

Length of output: 672


🏁 Script executed:

cat -n recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml | tail -20

Repository: ishandhanani/srt-slurm

Length of output: 672


Raise max-prefill-tokens to cover the 8k benchmark input.

benchmark.isl is 8192, but max-prefill-tokens is 2048, so prefill will reject or truncate 8k requests in this 8k1k recipe. Set it to >= 8192.

✅ Proposed fix
-      max-prefill-tokens: 2048
+      max-prefill-tokens: 8192
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 2048
chunked-prefill-size: 2048
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 8192
chunked-prefill-size: 2048
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml` around lines 56 - 59, The
max-prefill-tokens setting is too low for the 8k benchmark (benchmark.isl =
8192); update the value of max-prefill-tokens in the recipe to be at least 8192
(e.g., set max-prefill-tokens: 8192) so prefill does not reject or truncate 8k
requests; verify related settings like chunked-prefill-size if needed to support
the larger prefill window.

@ishandhanani ishandhanani merged commit da1472d into ishandhanani:main Feb 4, 2026
1 check was pending
ishandhanani added a commit that referenced this pull request Feb 4, 2026
* update

* Add H00 FP8 1k/8k configs

* update

* update

* update

* update

* update

* update

* update h100 container and dg

---------

Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`:
- Around line 77-81: The DEP parallelism is misconfigured: keep tp-size: 16 and
enable-dp-attention: true but change dp-size from 16 to 2 and ep-size from 16 to
8 so the configuration matches a 2-node × 8-GPU per node layout (tp-size: 16,
dp-size: 2, ep-size: 8); update the YAML entries for tp-size, dp-size, and
ep-size accordingly.

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`:
- Around line 3-6: Update the model container tag to the recommended runtime
variant by replacing the current container value "lmsysorg/sglang:v0.5.8-cu130"
with "lmsysorg/sglang:v0.5.8-cu130-runtime" in the model definition (look for
the container: field in the YAML where the string "lmsysorg/sglang:v0.5.8-cu130"
appears).
- Around line 36-40: Prefill parallelism currently sets tp-size: 16, dp-size: 1,
ep-size: 1, enable-dp-attention: false which mismatches the decode phase and
violates SGLang PD disaggregation best practices; update the prefill section to
use the same DP/EP and DP-attention settings as decode (e.g., set dp-size: 16,
ep-size: 16 and enable-dp-attention: true) so phase-homogeneous DP-attention
groups are maintained (adjust tp-size only if decode differs).

In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml`:
- Around line 56-59: The max-prefill-tokens value is too low for isl: 8192;
update the max-prefill-tokens setting (currently 2048) to at least 8192 so 8k
inputs won't be rejected or truncated, keep chunked-prefill-size at 2048, and
make the same change for the other max-prefill-tokens occurrence in this recipe;
reference the mem-fraction-static, max-prefill-tokens, and chunked-prefill-size
entries when locating the lines to edit.
- Around line 1-6: The recipe sets max-prefill-tokens to 2048 while the
benchmark input sequence length (isl) is 8192, causing a 4× mismatch that can
reject or stall prefill; update the max-prefill-tokens value (the YAML key
max-prefill-tokens) to at least 8192 or implement/confirm chunked prefill
behavior that safely handles isl=8192 (adjust associated prefill/chunking logic
if present in the code paths that read max-prefill-tokens and isl); verify the
change by ensuring the configuration (max-prefill-tokens) and the benchmark
parameter (isl) in this recipe are consistent.

In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml`:
- Around line 54-57: The max-prefill-tokens setting is too low for the 8k
benchmark (benchmark.isl = 8192); update the max-prefill-tokens value in the
YAML (parameter name: max-prefill-tokens) to at least 8192 (or higher if you
want headroom) so 8k prefill requests are not rejected or truncated, and keep
chunked-prefill-size aligned if needed.

Comment on lines +77 to +81
# Parallelism
tp-size: 16
dp-size: 16
ep-size: 16
enable-dp-attention: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

SGLang DEP Data Expert Parallelism dp-attention enable configuration

💡 Result:

To enable DEP (Data + Expert Parallelism) in SGLang for MoE models (e.g., DeepSeek V3/R1), you typically combine:

  • DP-Attention (data-parallel attention to reduce KV-cache duplication)
  • EP (expert parallelism for MoE experts)

1) Enable DP-Attention (dp-attention)

Add --enable-dp-attention and set --tp and --dp:

  • Single node (example: 8 GPUs):

    python -m sglang.launch_server \
      --model-path deepseek-ai/DeepSeek-V3 \
      --tp 8 --dp 8 --enable-dp-attention

    (Common high-throughput config; DP attention helps especially at large batch sizes.) [1]

  • Multi-node hybrid example (2 nodes × 8 GPUs = 16 GPUs total):

    python -m sglang.launch_server \
      --model-path deepseek-ai/DeepSeek-V3 \
      --tp 16 --dp 2 --enable-dp-attention

    This means 2 DP groups, each with 8 GPUs for attention TP. [1]

Notes:

  • DP-attention is not recommended for low-latency / small-batch serving. [1]
  • Some docs also state a constraint that “dp size should be equal to tp size” for --enable-dp-attention; treat this as a model/feature-specific limitation and prefer the per-model DeepSeek guidance above when applicable. [1][2]

2) Enable Expert Parallelism (EP) for MoE

Set --ep and (optionally) choose EP backends:

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 --ep 8 \
  --moe-a2a-backend deepep \
  --moe-runner-backend deep_gemm

(Example from SGLang EP docs.) [3]

Important EP constraint:

  • For deepep / mooncake / some other A2A backends, SGLang currently supports ep_size = tp_size (within the expert-parallel group). [3]

3) Putting it together (DEP = DP-attention + EP) on 16 GPUs (2×8)

A common pattern is:

  • --tp 16 --dp 2 --enable-dp-attention (attention uses 2 DP groups × 8-way TP) [1]
  • EP runs within each DP group, so set --ep 8 and use EP backends that require ep=tp_per_group (8 here). [3]

Skeleton:

python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 16 --dp 2 --enable-dp-attention \
  --ep 8 --moe-a2a-backend deepep --moe-runner-backend deep_gemm

4) Extra: DP-attention recommended defaults for DeepSeek V3.2

For DeepSeek V3.2, SGLang notes kernels are tuned around dp_size=8, recommending:

--dp 8 --enable-dp-attention

for stability/perf in their tests. [4]

Sources: [1] DeepSeek V3/V3.1/R1 Usage (SGLang docs), [2] Server Arguments (SGLang docs), [3] Expert Parallelism (SGLang docs), [4] DeepSeek V3.2 Usage (SGLang docs.


Correct DEP parallelism settings for 16-GPU allocation.

The decode configuration uses tp-size: 16, dp-size: 16, ep-size: 16 on 16 GPUs, but this deviates from SGLang's documented DEP pattern. For a 2-node × 8-GPU setup with enable-dp-attention, SGLang's recommended configuration is:

  • tp: 16, dp: 2 (2 DP groups, each with 8-way TP)
  • ep: 8 (expert parallelism within each DP group)

With 16 total GPUs, dp=16 is not feasible—it would require 16 model replicas across 16 GPUs. Change to dp-size: 2 and ep-size: 8 to align with SGLang's DEP semantics and DeepSeek's tuned defaults.

🤖 Prompt for AI Agents
In `@recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 77 - 81,
The DEP parallelism is misconfigured: keep tp-size: 16 and enable-dp-attention:
true but change dp-size from 16 to 2 and ep-size from 16 to 8 so the
configuration matches a 2-node × 8-GPU per node layout (tp-size: 16, dp-size: 2,
ep-size: 8); update the YAML entries for tp-size, dp-size, and ep-size
accordingly.

Comment on lines +3 to +6
model:
path: "dsfp8"
container: "lmsysorg/sglang:v0.5.8-cu130"
precision: "fp8"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

lmsysorg/sglang v0.5.8 docker image tags runtime cu130

💡 Result:

For SGLang v0.5.8 with the CUDA 13.0 (“cu130”) runtime variant, the tag you want is:

  • lmsysorg/sglang:v0.5.8-cu130-runtime (pinned to the v0.5.8 release) [3]

Related / commonly used aliases:

  • lmsysorg/sglang:latest-cu130-runtime (moving “stable” CUDA 13.0 runtime tag recommended in the install docs) [1]

Tag naming pattern confirmation (example from the previous release):

  • lmsysorg/sglang:v0.5.7-cu130-runtime exists on Docker Hub mirrors, matching the vX.Y.Z-cu130-runtime convention [2]

Sources: SGLang install docs (CUDA 13.0 runtime recommendation) [1], Docker image metadata for v0.5.7-cu130-runtime [2], SGLang v0.5.8 release tag (confirms v0.5.8 exists) [3].


Update container tag to the recommended runtime variant.

Line 5 uses v0.5.8-cu130, but the recommended tag for this release is lmsysorg/sglang:v0.5.8-cu130-runtime. The runtime variant is smaller and is the standard recommendation in the SGLang install documentation.

Proposed fix
-  container: "lmsysorg/sglang:v0.5.8-cu130"
+  container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 3 - 6,
Update the model container tag to the recommended runtime variant by replacing
the current container value "lmsysorg/sglang:v0.5.8-cu130" with
"lmsysorg/sglang:v0.5.8-cu130-runtime" in the model definition (look for the
container: field in the YAML where the string "lmsysorg/sglang:v0.5.8-cu130"
appears).

Comment on lines +36 to +40
# Parallelism
tp-size: 16
dp-size: 1
ep-size: 1
enable-dp-attention: false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "h100-fp8-1p1d-max-dep-mtp.yaml" 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 230


🏁 Script executed:

# Search for recipe files in h100 directory to understand structure
fd "max-dep" --type f 2>/dev/null | head -20

Repository: ishandhanani/srt-slurm

Length of output: 371


🏁 Script executed:

# Look at the specific file mentioned in the review
cat -n recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml 2>/dev/null || find . -path "*recipes*max-dep*" -type f 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 3357


🏁 Script executed:

# Check the 1k8k max-dep-mtp variant
cat -n recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1495


🏁 Script executed:

# Check the 1k1k max-dep-mtp variant
cat -n recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1489


🏁 Script executed:

# Check the 8k1k stp (non-MTP) max-dep variant
cat -n recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1430


🏁 Script executed:

# Check the 1k8k stp variant
cat -n recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1430


🏁 Script executed:

# Check the 1k1k stp variant
cat -n recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '36,81p'

Repository: ishandhanani/srt-slurm

Length of output: 1430


🏁 Script executed:

# Search for any documentation or comments about prefill/decode parallelism strategy
rg -i "prefill.*parallelism|tp-only|disaggregat" --type yaml -A 2 -B 2 recipes/ | head -50

Repository: ishandhanani/srt-slurm

Length of output: 3667


🏁 Script executed:

# Look for README or documentation files that explain the recipe design
find . -type f \( -name "README*" -o -name "*.md" \) -exec grep -l "max-dep\|prefill\|decode" {} \; 2>/dev/null | head -10

Repository: ishandhanani/srt-slurm

Length of output: 268


🏁 Script executed:

# Check CLAUDE.md for relevant information
cat CLAUDE.md | grep -i -A 5 -B 5 "max-dep\|prefill.*parallelism\|decode.*parallelism" | head -100

Repository: ishandhanani/srt-slurm

Length of output: 48


🏁 Script executed:

# Check config reference for parallelism guidance
cat docs/config-reference.md | grep -i -A 3 -B 3 "dp-size\|enable-dp-attention\|disaggregat" | head -80

Repository: ishandhanani/srt-slurm

Length of output: 3109


🏁 Script executed:

# Check architecture documentation
cat docs/architecture.md | grep -i -A 5 -B 5 "prefill\|decode\|disaggregat" | head -100

Repository: ishandhanani/srt-slurm

Length of output: 3553


🏁 Script executed:

# Search for any comments or documentation about why prefill uses TP-only
rg "TP.*only|prefill.*parallelism|decode.*parallelism" --type yaml --type md -B 2 -A 2 | head -80

Repository: ishandhanani/srt-slurm

Length of output: 476


🏁 Script executed:

# Check if there are any non-max-dep recipes with different prefill/decode parallelism patterns
fd "\.yaml$" recipes/ -x grep -l "dp-size.*16" {} \; | head -20

Repository: ishandhanani/srt-slurm

Length of output: 399


🏁 Script executed:

# Look for any references to prefill-specific optimization or reasoning
find . -type f -name "*.md" -o -name "*.py" | xargs grep -i "prefill.*parallelism\|why.*prefill\|tp.*only" 2>/dev/null | head -30

Repository: ishandhanani/srt-slurm

Length of output: 725


🏁 Script executed:

# Check if there's a blog post that explains the max-dep strategy
cat recipes/h100/blog-post.yaml | head -100

Repository: ishandhanani/srt-slurm

Length of output: 2659


🏁 Script executed:

# Look for any issue tracker or comments discussing prefill parallelism strategy
find . -type f \( -name "*.md" -o -name "*.txt" \) -exec grep -l "prefill.*parallelism\|prefill.*tp.*only\|decode.*expert" {} \; 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 48


🏁 Script executed:

# Check git history or comments for when max-dep recipes were created
git log --oneline recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml 2>/dev/null | head -10

Repository: ishandhanani/srt-slurm

Length of output: 102


🏁 Script executed:

# Search for any discussion or rationale in commit messages or issues
find . -type f -name "*.md" -o -name "*.txt" -o -name "NOTES*" | xargs grep -i "max-dep\|prefill.*tp.*only" 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 50


🏁 Script executed:

# Compare the blog-post config with max-dep recipes more carefully
echo "=== BLOG POST PREFILL ===" && cat recipes/h100/blog-post.yaml | sed -n '48,65p' && echo -e "\n=== MAX-DEP PREFILL ===" && cat recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml | sed -n '36,40p'

Repository: ishandhanani/srt-slurm

Length of output: 713


🏁 Script executed:

# Check when max-dep recipes were created vs blog-post
git log --oneline recipes/h100/blog-post.yaml 2>/dev/null | head -5

Repository: ishandhanani/srt-slurm

Length of output: 102


🏁 Script executed:

# Look for any SGLang documentation or issues about prefill parallelism strategy
rg "prefill.*expert\|expert.*parallel.*prefill\|disable.*dp.*attention" --type md --type yaml -B 2 -A 2 | head -60

Repository: ishandhanani/srt-slurm

Length of output: 48


🏁 Script executed:

# Check if there's any README in the h100 recipes directory
cat recipes/h100/README.md 2>/dev/null || find recipes/h100 -maxdepth 1 -name "*.md" -o -name "*.txt"

Repository: ishandhanani/srt-slurm

Length of output: 48


🌐 Web query:

SGLang disaggregated prefill decode parallelism strategy dp-size expert parallel

💡 Result:

In SGLang PD (Prefill/Decode) disaggregation, the core idea is to separate the compute-bound prefill stage from the memory-/KV-cache-bound decode stage to avoid (1) decode being interrupted by prefill and (2) “DP attention imbalance” where different DP workers end up doing different phases at the same time, increasing decode latency. [1]

How to think about the parallelism knobs

1) --dp-size (with --enable-dp-attention)

  • In PD mode, you typically want prefill and decode to each run as a “homogeneous” group, so that within a DP-attention group everyone is doing prefill (on prefill workers) or decode (on decode workers), instead of mixing phases (that’s the imbalance PD is meant to eliminate). [1]
  • In SGLang’s DP-attention setup (as documented for DeepSeek), the attention tensor-parallel group size is effectively:
    atten_tp_size = tp_size / dp_size. [2]
    This matters because it changes the “shape” of the attention parallel group and any features that reuse it (e.g., some context-parallel setups).

2) Prefill-side strategy (compute-bound)

  • Favor parallelism that improves throughput on large prompt batches / long prompts, e.g.:
    • higher TP and/or PP + chunked prefill (pipeline parallelism for long context).
  • SGLang’s PP work explicitly targets long-context TTFT and is designed to be compatible with PD disaggregation, and the roadmap notes configurations with different TP degrees on prefill vs decode. [3]

3) Decode-side strategy (KV/memory-bound)

  • Scale decode capacity primarily by adding more decode workers (horizontal scale) and by using MoE-friendly parallelism when applicable.
  • For MoE models, SGLang recommends using Expert Parallelism (EP) to distribute expert weights and reduce MoE memory bottlenecks. [4]
  • If you use DeepEP/Mooncake for MoE all-to-all (--moe-a2a-backend deepep|mooncake), note the current constraint: these backends only support ep_size = tp_size (hybrid ep_size < tp_size requires the none backend). [4]

A concrete, “known-good” PD + DP-attention + MoE pattern (DeepSeek example)

SGLang’s PD disaggregation docs show multi-node DeepSeek workers launched with the same DP-attention shape on both sides (example uses --tp-size 16 --dp-size 8 --enable-dp-attention) and MoE all-to-all via DeepEP (--moe-a2a-backend deepep). [1]

Rule of thumb:

  • Prefill workers: optimize for compute/TTFT (TP/PP/chunking), keep DP-attention groups phase-homogeneous.
  • Decode workers: optimize for KV-cache / MoE memory & latency; for MoE, prefer EP, and if using DeepEP/Mooncake keep ep == tp. [1][4]

Sources: [1] PD disaggregation docs, incl. DP-attention imbalance + DeepSeek PD launch example. [2] DeepSeek V3.2 docs (atten_tp_size = tp_size / dp_size). [3] SGLang PP roadmap issue (PP compatible with PD, differing TP degrees prefill vs decode). [4] SGLang expert parallelism docs (DeepEP/Mooncake constraint ep_size = tp_size, backend flags).


Prefill parallelism settings are misaligned with SGLang PD disaggregation best practices.

The current prefill config uses TP-only (dp-size: 1, ep-size: 1, enable-dp-attention: false) while decode uses full DEP (dp-size: 16, ep-size: 16, enable-dp-attention: true). SGLang's documented PD disaggregation pattern requires prefill and decode to maintain phase-homogeneous DP-attention groups to avoid attention imbalance. The blog-post config in this repository and SGLang's official guidance both use matching DP settings across both phases (e.g., dp-size: 16 on both).

Recommended fix
-      dp-size: 1
-      ep-size: 1
-      enable-dp-attention: false
+      dp-size: 16
+      ep-size: 16
+      enable-dp-attention: true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Parallelism
tp-size: 16
dp-size: 1
ep-size: 1
enable-dp-attention: false
# Parallelism
tp-size: 16
dp-size: 16
ep-size: 16
enable-dp-attention: true
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 36 - 40,
Prefill parallelism currently sets tp-size: 16, dp-size: 1, ep-size: 1,
enable-dp-attention: false which mismatches the decode phase and violates SGLang
PD disaggregation best practices; update the prefill section to use the same
DP/EP and DP-attention settings as decode (e.g., set dp-size: 16, ep-size: 16
and enable-dp-attention: true) so phase-homogeneous DP-attention groups are
maintained (adjust tp-size only if decode differs).

Comment on lines +1 to +6
name: "h100-fp8-1p1d-max-tp-mtp"

model:
path: "dsfp8"
container: "lmsysorg/sglang:v0.5.8-cu130"
precision: "fp8"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find and explore other H100 recipes to see which container tag pattern is used
find recipes -name "*.yaml" -path "*/h100/*" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 699


🏁 Script executed:

# Check the full content of the file in question
cat -n recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 3385


🏁 Script executed:

# Search for container tag patterns across all H100 recipes
rg "container:" recipes/h100/ -A 0 | head -20

Repository: ishandhanani/srt-slurm

Length of output: 1254


🌐 Web query:

SGLang sglang v0.5.8 cu130 runtime image differences lmsysorg/sglang container tags

💡 Result:

How to read lmsysorg/sglang tags (with focus on v0.5.8 + cu130 + runtime)

1) *-runtime vs non-runtime

  • ...-runtime images are intended for production and are significantly smaller (~40% reduction) because they exclude build tools and development dependencies. (docs.sglang.io)
  • Non-runtime images (e.g., latest) are “full” images that include more dev/build tooling (better for debugging or building extra deps inside the container). (docs.sglang.io)

2) What cu130 means

  • cu130 means the image is built for CUDA 13 (cu130). SGLang docs explicitly recommend using lmsysorg/sglang:latest-cu130-runtime for CUDA 13 environments (and dev-cu13 for nightly/dev). (docs.sglang.io)
  • In practice, the CUDA-13 runtime stack in these images includes CUDA 13.0.x user-space libraries (and requires a compatible NVIDIA driver on the host, as usual for CUDA containers).

3) What’s inside a cu130-runtime image (example from the official lineage)
A publicly indexed copy of lmsysorg/sglang:v0.5.7-cu130-runtime shows the typical CUDA-13 runtime base/contents used for these tags, including:

  • CUDA_VERSION=13.0.1
  • Ubuntu 24.04 base
  • Python 3.12
  • cuDNN 9.13.x (CUDA 13 build) (docker.aityp.com)
    This is the cu130-runtime “family” you should expect v0.5.8-cu130-runtime to follow as well (same intent and very similar base stack), even if the exact patch versions can vary across builds.

4) dev-cu13 / nightly CUDA-13 tags
SGLang’s CUDA-13 docker workflow builds and publishes these tags:

  • lmsysorg/sglang:dev-x86-cu13
  • lmsysorg/sglang:dev-arm64-cu13
  • a multi-arch manifest lmsysorg/sglang:dev-cu13
  • plus nightly stamp tags like nightly-dev-cu13-YYYYMMDD-<sha> (github.com)
    These track the latest code rather than a fixed release like v0.5.8.

5) Version-pinned tags

  • Tags starting with v0.5.8-... are pinned to the v0.5.8 release line (for example, docs reference lmsysorg/sglang:v0.5.8-rocm700-mi30x as a v0.5.8-specific build for that platform). (docs.sglang.io)

If you tell me which tags you’re comparing against (e.g., v0.5.8-cu130 vs v0.5.8-cu130-runtime vs latest-cu130-runtime), I can summarize the differences in a small table.

Citations:


Verify max-prefill-tokens aligns with benchmark input sequence length.

The max-prefill-tokens is set to 2048 (line 58), but the benchmark uses isl: 8192 (line 108). This 4× mismatch could cause prefill to reject or stall requests during benchmarking. Confirm whether this is intentional (e.g., chunked prefill handles overflow) or if max-prefill-tokens should be increased to at least 8192.

Regarding the container tag: all other H100 recipes in the repository consistently use lmsysorg/sglang:v0.5.8-cu130 (non-runtime), so this file is aligned with the existing codebase pattern.

🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml` around lines 1 - 6, The
recipe sets max-prefill-tokens to 2048 while the benchmark input sequence length
(isl) is 8192, causing a 4× mismatch that can reject or stall prefill; update
the max-prefill-tokens value (the YAML key max-prefill-tokens) to at least 8192
or implement/confirm chunked prefill behavior that safely handles isl=8192
(adjust associated prefill/chunking logic if present in the code paths that read
max-prefill-tokens and isl); verify the change by ensuring the configuration
(max-prefill-tokens) and the benchmark parameter (isl) in this recipe are
consistent.

Comment on lines +56 to +59
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 2048
chunked-prefill-size: 2048
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

max-prefill-tokens is too low for the 8k ISL benchmark.

isl: 8192 exceeds max-prefill-tokens: 2048, so 8k inputs will likely be rejected or truncated. Bump the max to at least 8192 (chunk size can stay 2048).

✅ Suggested fix
-      max-prefill-tokens: 2048
+      max-prefill-tokens: 8192

Also applies to: 108-109

🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml` around lines 56 - 59,
The max-prefill-tokens value is too low for isl: 8192; update the
max-prefill-tokens setting (currently 2048) to at least 8192 so 8k inputs won't
be rejected or truncated, keep chunked-prefill-size at 2048, and make the same
change for the other max-prefill-tokens occurrence in this recipe; reference the
mem-fraction-static, max-prefill-tokens, and chunked-prefill-size entries when
locating the lines to edit.

Comment on lines +54 to +57
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 2048
chunked-prefill-size: 2048
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Raise max-prefill-tokens to cover the 8k benchmark input.

benchmark.isl is 8192 while max-prefill-tokens is 2048, so 8k requests will be rejected or truncated.

✅ Proposed fix
-      max-prefill-tokens: 2048
+      max-prefill-tokens: 8192
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 2048
chunked-prefill-size: 2048
# Memory and token limits
mem-fraction-static: 0.6
max-prefill-tokens: 8192
chunked-prefill-size: 2048
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml` around lines 54 - 57, The
max-prefill-tokens setting is too low for the 8k benchmark (benchmark.isl =
8192); update the max-prefill-tokens value in the YAML (parameter name:
max-prefill-tokens) to at least 8192 (or higher if you want headroom) so 8k
prefill requests are not rejected or truncated, and keep chunked-prefill-size
aligned if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants