Add H100 FP8 SGLang Disaggregated Recipes for DSR1#121
Add H100 FP8 SGLang Disaggregated Recipes for DSR1#121ishandhanani merged 13 commits intoishandhanani:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds 12 new H100 FP8 YAML deployment recipes covering 1k1k, 1k8k, and 8k1k input/output variants. Each file defines model metadata, per-phase resources and environment, separate prefill/decode sglang_config (parallelism, disaggregation, cache), optional MTP/EAGLE parameters, and benchmark settings. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`:
- Around line 58-61: The recipe's max-prefill-tokens is too low for 8k inputs
(benchmark.isl: 8192); update the max-prefill-tokens value in the h100 8k1k
recipe files (e.g., the mtp/stp YAMLs that define mem-fraction-static,
max-prefill-tokens, chunked-prefill-size) to at least 8192 so 8192-token
requests are accepted (also consider aligning chunked-prefill-size to 8192 or a
suitable chunking strategy to avoid premature truncation).
In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml`:
- Around line 56-59: The max-prefill-tokens setting is too low for the 8k
benchmark (benchmark.isl = 8192); update the value of max-prefill-tokens in the
recipe to be at least 8192 (e.g., set max-prefill-tokens: 8192) so prefill does
not reject or truncate 8k requests; verify related settings like
chunked-prefill-size if needed to support the larger prefill window.
🧹 Nitpick comments (1)
recipes/h100/1k8k/stp/h100-fp8-1p2d-max-tp.yaml (1)
39-39: Minor: Trailing whitespace.Line 39 has trailing whitespace after
1.🧹 Proposed fix
- ep-size: 1 + ep-size: 1
| # Memory and token limits | ||
| mem-fraction-static: 0.6 | ||
| max-prefill-tokens: 2048 | ||
| chunked-prefill-size: 2048 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "h100-fp8-1p1d-max-dep-mtp.yaml" -type fRepository: ishandhanani/srt-slurm
Length of output: 230
🏁 Script executed:
cat -n recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 3505
🏁 Script executed:
# Search for benchmark.isl settings in the same directory
grep -r "benchmark.isl" recipes/h100/8k1k/mtp/ || echo "Not found in mtp directory"Repository: ishandhanani/srt-slurm
Length of output: 92
🏁 Script executed:
# Search for benchmark.isl in the parent or related directories
find recipes/h100/8k1k -name "*.yaml" -exec grep -l "benchmark.isl" {} \;Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
# Check the 1k1k variant
sed -n '58,111p' recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 1341
🏁 Script executed:
# Check the 1k8k variant
sed -n '58,111p' recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 1341
🏁 Script executed:
# Search for documentation or usage comments about max-prefill-tokens
grep -r "max-prefill-tokens" . --include="*.yaml" --include="*.md" --include="*.py" -B2 -A2Repository: ishandhanani/srt-slurm
Length of output: 11690
🏁 Script executed:
find recipes/h100/8k1k -name "*.yaml" -type fRepository: ishandhanani/srt-slurm
Length of output: 267
🏁 Script executed:
# Check all h100 8k1k files for their benchmark isl values
for file in $(find recipes/h100/8k1k -name "*.yaml" -type f); do
echo "=== $file ==="
grep -A2 "benchmark:" "$file" | head -5
grep "max-prefill-tokens:" "$file"
doneRepository: ishandhanani/srt-slurm
Length of output: 591
Increase max-prefill-tokens to support 8k benchmark input.
The 8k1k recipe has benchmark.isl: 8192 but max-prefill-tokens: 2048, which will reject or truncate 8k requests. Update to at least 8192.
Proposed fix
max-prefill-tokens: 8192This affects all h100 8k1k recipe variants: both mtp and stp modes (4 files total).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Memory and token limits | |
| mem-fraction-static: 0.6 | |
| max-prefill-tokens: 2048 | |
| chunked-prefill-size: 2048 | |
| # Memory and token limits | |
| mem-fraction-static: 0.6 | |
| max-prefill-tokens: 8192 | |
| chunked-prefill-size: 2048 |
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 58 - 61,
The recipe's max-prefill-tokens is too low for 8k inputs (benchmark.isl: 8192);
update the max-prefill-tokens value in the h100 8k1k recipe files (e.g., the
mtp/stp YAMLs that define mem-fraction-static, max-prefill-tokens,
chunked-prefill-size) to at least 8192 so 8192-token requests are accepted (also
consider aligning chunked-prefill-size to 8192 or a suitable chunking strategy
to avoid premature truncation).
| # Memory and token limits | ||
| mem-fraction-static: 0.6 | ||
| max-prefill-tokens: 2048 | ||
| chunked-prefill-size: 2048 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd "h100-fp8-1p1d-max-dep.yaml" --type fRepository: ishandhanani/srt-slurm
Length of output: 212
🏁 Script executed:
cat -n recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '50,110p'Repository: ishandhanani/srt-slurm
Length of output: 1807
🏁 Script executed:
rg "max-prefill-tokens" recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml -nRepository: ishandhanani/srt-slurm
Length of output: 99
🏁 Script executed:
rg "max-prefill-tokens" recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml -nRepository: ishandhanani/srt-slurm
Length of output: 231
🏁 Script executed:
rg "^ benchmark:" -A 4 recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yamlRepository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
cat -n recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml | tail -20Repository: ishandhanani/srt-slurm
Length of output: 672
🏁 Script executed:
cat -n recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml | tail -20Repository: ishandhanani/srt-slurm
Length of output: 672
Raise max-prefill-tokens to cover the 8k benchmark input.
benchmark.isl is 8192, but max-prefill-tokens is 2048, so prefill will reject or truncate 8k requests in this 8k1k recipe. Set it to >= 8192.
✅ Proposed fix
- max-prefill-tokens: 2048
+ max-prefill-tokens: 8192📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Memory and token limits | |
| mem-fraction-static: 0.6 | |
| max-prefill-tokens: 2048 | |
| chunked-prefill-size: 2048 | |
| # Memory and token limits | |
| mem-fraction-static: 0.6 | |
| max-prefill-tokens: 8192 | |
| chunked-prefill-size: 2048 |
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml` around lines 56 - 59, The
max-prefill-tokens setting is too low for the 8k benchmark (benchmark.isl =
8192); update the value of max-prefill-tokens in the recipe to be at least 8192
(e.g., set max-prefill-tokens: 8192) so prefill does not reject or truncate 8k
requests; verify related settings like chunked-prefill-size if needed to support
the larger prefill window.
* update * Add H00 FP8 1k/8k configs * update * update * update * update * update * update * update h100 container and dg --------- Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`:
- Around line 77-81: The DEP parallelism is misconfigured: keep tp-size: 16 and
enable-dp-attention: true but change dp-size from 16 to 2 and ep-size from 16 to
8 so the configuration matches a 2-node × 8-GPU per node layout (tp-size: 16,
dp-size: 2, ep-size: 8); update the YAML entries for tp-size, dp-size, and
ep-size accordingly.
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml`:
- Around line 3-6: Update the model container tag to the recommended runtime
variant by replacing the current container value "lmsysorg/sglang:v0.5.8-cu130"
with "lmsysorg/sglang:v0.5.8-cu130-runtime" in the model definition (look for
the container: field in the YAML where the string "lmsysorg/sglang:v0.5.8-cu130"
appears).
- Around line 36-40: Prefill parallelism currently sets tp-size: 16, dp-size: 1,
ep-size: 1, enable-dp-attention: false which mismatches the decode phase and
violates SGLang PD disaggregation best practices; update the prefill section to
use the same DP/EP and DP-attention settings as decode (e.g., set dp-size: 16,
ep-size: 16 and enable-dp-attention: true) so phase-homogeneous DP-attention
groups are maintained (adjust tp-size only if decode differs).
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml`:
- Around line 56-59: The max-prefill-tokens value is too low for isl: 8192;
update the max-prefill-tokens setting (currently 2048) to at least 8192 so 8k
inputs won't be rejected or truncated, keep chunked-prefill-size at 2048, and
make the same change for the other max-prefill-tokens occurrence in this recipe;
reference the mem-fraction-static, max-prefill-tokens, and chunked-prefill-size
entries when locating the lines to edit.
- Around line 1-6: The recipe sets max-prefill-tokens to 2048 while the
benchmark input sequence length (isl) is 8192, causing a 4× mismatch that can
reject or stall prefill; update the max-prefill-tokens value (the YAML key
max-prefill-tokens) to at least 8192 or implement/confirm chunked prefill
behavior that safely handles isl=8192 (adjust associated prefill/chunking logic
if present in the code paths that read max-prefill-tokens and isl); verify the
change by ensuring the configuration (max-prefill-tokens) and the benchmark
parameter (isl) in this recipe are consistent.
In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml`:
- Around line 54-57: The max-prefill-tokens setting is too low for the 8k
benchmark (benchmark.isl = 8192); update the max-prefill-tokens value in the
YAML (parameter name: max-prefill-tokens) to at least 8192 (or higher if you
want headroom) so 8k prefill requests are not rejected or truncated, and keep
chunked-prefill-size aligned if needed.
| # Parallelism | ||
| tp-size: 16 | ||
| dp-size: 16 | ||
| ep-size: 16 | ||
| enable-dp-attention: true |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
SGLang DEP Data Expert Parallelism dp-attention enable configuration
💡 Result:
To enable DEP (Data + Expert Parallelism) in SGLang for MoE models (e.g., DeepSeek V3/R1), you typically combine:
- DP-Attention (data-parallel attention to reduce KV-cache duplication)
- EP (expert parallelism for MoE experts)
1) Enable DP-Attention (dp-attention)
Add --enable-dp-attention and set --tp and --dp:
-
Single node (example: 8 GPUs):
python -m sglang.launch_server \ --model-path deepseek-ai/DeepSeek-V3 \ --tp 8 --dp 8 --enable-dp-attention
(Common high-throughput config; DP attention helps especially at large batch sizes.) [1]
-
Multi-node hybrid example (2 nodes × 8 GPUs = 16 GPUs total):
python -m sglang.launch_server \ --model-path deepseek-ai/DeepSeek-V3 \ --tp 16 --dp 2 --enable-dp-attention
This means 2 DP groups, each with 8 GPUs for attention TP. [1]
Notes:
- DP-attention is not recommended for low-latency / small-batch serving. [1]
- Some docs also state a constraint that “dp size should be equal to tp size” for
--enable-dp-attention; treat this as a model/feature-specific limitation and prefer the per-model DeepSeek guidance above when applicable. [1][2]
2) Enable Expert Parallelism (EP) for MoE
Set --ep and (optionally) choose EP backends:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--tp 8 --ep 8 \
--moe-a2a-backend deepep \
--moe-runner-backend deep_gemm(Example from SGLang EP docs.) [3]
Important EP constraint:
- For
deepep/mooncake/ some other A2A backends, SGLang currently supportsep_size = tp_size(within the expert-parallel group). [3]
3) Putting it together (DEP = DP-attention + EP) on 16 GPUs (2×8)
A common pattern is:
--tp 16 --dp 2 --enable-dp-attention(attention uses 2 DP groups × 8-way TP) [1]- EP runs within each DP group, so set
--ep 8and use EP backends that requireep=tp_per_group(8 here). [3]
Skeleton:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--tp 16 --dp 2 --enable-dp-attention \
--ep 8 --moe-a2a-backend deepep --moe-runner-backend deep_gemm4) Extra: DP-attention recommended defaults for DeepSeek V3.2
For DeepSeek V3.2, SGLang notes kernels are tuned around dp_size=8, recommending:
--dp 8 --enable-dp-attentionfor stability/perf in their tests. [4]
Sources: [1] DeepSeek V3/V3.1/R1 Usage (SGLang docs), [2] Server Arguments (SGLang docs), [3] Expert Parallelism (SGLang docs), [4] DeepSeek V3.2 Usage (SGLang docs.
Correct DEP parallelism settings for 16-GPU allocation.
The decode configuration uses tp-size: 16, dp-size: 16, ep-size: 16 on 16 GPUs, but this deviates from SGLang's documented DEP pattern. For a 2-node × 8-GPU setup with enable-dp-attention, SGLang's recommended configuration is:
tp: 16, dp: 2(2 DP groups, each with 8-way TP)ep: 8(expert parallelism within each DP group)
With 16 total GPUs, dp=16 is not feasible—it would require 16 model replicas across 16 GPUs. Change to dp-size: 2 and ep-size: 8 to align with SGLang's DEP semantics and DeepSeek's tuned defaults.
🤖 Prompt for AI Agents
In `@recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 77 - 81,
The DEP parallelism is misconfigured: keep tp-size: 16 and enable-dp-attention:
true but change dp-size from 16 to 2 and ep-size from 16 to 8 so the
configuration matches a 2-node × 8-GPU per node layout (tp-size: 16, dp-size: 2,
ep-size: 8); update the YAML entries for tp-size, dp-size, and ep-size
accordingly.
| model: | ||
| path: "dsfp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130" | ||
| precision: "fp8" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
lmsysorg/sglang v0.5.8 docker image tags runtime cu130
💡 Result:
For SGLang v0.5.8 with the CUDA 13.0 (“cu130”) runtime variant, the tag you want is:
lmsysorg/sglang:v0.5.8-cu130-runtime(pinned to the v0.5.8 release) [3]
Related / commonly used aliases:
lmsysorg/sglang:latest-cu130-runtime(moving “stable” CUDA 13.0 runtime tag recommended in the install docs) [1]
Tag naming pattern confirmation (example from the previous release):
lmsysorg/sglang:v0.5.7-cu130-runtimeexists on Docker Hub mirrors, matching thevX.Y.Z-cu130-runtimeconvention [2]
Sources: SGLang install docs (CUDA 13.0 runtime recommendation) [1], Docker image metadata for v0.5.7-cu130-runtime [2], SGLang v0.5.8 release tag (confirms v0.5.8 exists) [3].
Update container tag to the recommended runtime variant.
Line 5 uses v0.5.8-cu130, but the recommended tag for this release is lmsysorg/sglang:v0.5.8-cu130-runtime. The runtime variant is smaller and is the standard recommendation in the SGLang install documentation.
Proposed fix
- container: "lmsysorg/sglang:v0.5.8-cu130"
+ container: "lmsysorg/sglang:v0.5.8-cu130-runtime"🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 3 - 6,
Update the model container tag to the recommended runtime variant by replacing
the current container value "lmsysorg/sglang:v0.5.8-cu130" with
"lmsysorg/sglang:v0.5.8-cu130-runtime" in the model definition (look for the
container: field in the YAML where the string "lmsysorg/sglang:v0.5.8-cu130"
appears).
| # Parallelism | ||
| tp-size: 16 | ||
| dp-size: 1 | ||
| ep-size: 1 | ||
| enable-dp-attention: false |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "h100-fp8-1p1d-max-dep-mtp.yaml" 2>/dev/nullRepository: ishandhanani/srt-slurm
Length of output: 230
🏁 Script executed:
# Search for recipe files in h100 directory to understand structure
fd "max-dep" --type f 2>/dev/null | head -20Repository: ishandhanani/srt-slurm
Length of output: 371
🏁 Script executed:
# Look at the specific file mentioned in the review
cat -n recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml 2>/dev/null || find . -path "*recipes*max-dep*" -type f 2>/dev/nullRepository: ishandhanani/srt-slurm
Length of output: 3357
🏁 Script executed:
# Check the 1k8k max-dep-mtp variant
cat -n recipes/h100/1k8k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml | sed -n '36,81p'Repository: ishandhanani/srt-slurm
Length of output: 1495
🏁 Script executed:
# Check the 1k1k max-dep-mtp variant
cat -n recipes/h100/1k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml | sed -n '36,81p'Repository: ishandhanani/srt-slurm
Length of output: 1489
🏁 Script executed:
# Check the 8k1k stp (non-MTP) max-dep variant
cat -n recipes/h100/8k1k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '36,81p'Repository: ishandhanani/srt-slurm
Length of output: 1430
🏁 Script executed:
# Check the 1k8k stp variant
cat -n recipes/h100/1k8k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '36,81p'Repository: ishandhanani/srt-slurm
Length of output: 1430
🏁 Script executed:
# Check the 1k1k stp variant
cat -n recipes/h100/1k1k/stp/h100-fp8-1p1d-max-dep.yaml | sed -n '36,81p'Repository: ishandhanani/srt-slurm
Length of output: 1430
🏁 Script executed:
# Search for any documentation or comments about prefill/decode parallelism strategy
rg -i "prefill.*parallelism|tp-only|disaggregat" --type yaml -A 2 -B 2 recipes/ | head -50Repository: ishandhanani/srt-slurm
Length of output: 3667
🏁 Script executed:
# Look for README or documentation files that explain the recipe design
find . -type f \( -name "README*" -o -name "*.md" \) -exec grep -l "max-dep\|prefill\|decode" {} \; 2>/dev/null | head -10Repository: ishandhanani/srt-slurm
Length of output: 268
🏁 Script executed:
# Check CLAUDE.md for relevant information
cat CLAUDE.md | grep -i -A 5 -B 5 "max-dep\|prefill.*parallelism\|decode.*parallelism" | head -100Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
# Check config reference for parallelism guidance
cat docs/config-reference.md | grep -i -A 3 -B 3 "dp-size\|enable-dp-attention\|disaggregat" | head -80Repository: ishandhanani/srt-slurm
Length of output: 3109
🏁 Script executed:
# Check architecture documentation
cat docs/architecture.md | grep -i -A 5 -B 5 "prefill\|decode\|disaggregat" | head -100Repository: ishandhanani/srt-slurm
Length of output: 3553
🏁 Script executed:
# Search for any comments or documentation about why prefill uses TP-only
rg "TP.*only|prefill.*parallelism|decode.*parallelism" --type yaml --type md -B 2 -A 2 | head -80Repository: ishandhanani/srt-slurm
Length of output: 476
🏁 Script executed:
# Check if there are any non-max-dep recipes with different prefill/decode parallelism patterns
fd "\.yaml$" recipes/ -x grep -l "dp-size.*16" {} \; | head -20Repository: ishandhanani/srt-slurm
Length of output: 399
🏁 Script executed:
# Look for any references to prefill-specific optimization or reasoning
find . -type f -name "*.md" -o -name "*.py" | xargs grep -i "prefill.*parallelism\|why.*prefill\|tp.*only" 2>/dev/null | head -30Repository: ishandhanani/srt-slurm
Length of output: 725
🏁 Script executed:
# Check if there's a blog post that explains the max-dep strategy
cat recipes/h100/blog-post.yaml | head -100Repository: ishandhanani/srt-slurm
Length of output: 2659
🏁 Script executed:
# Look for any issue tracker or comments discussing prefill parallelism strategy
find . -type f \( -name "*.md" -o -name "*.txt" \) -exec grep -l "prefill.*parallelism\|prefill.*tp.*only\|decode.*expert" {} \; 2>/dev/nullRepository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
# Check git history or comments for when max-dep recipes were created
git log --oneline recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml 2>/dev/null | head -10Repository: ishandhanani/srt-slurm
Length of output: 102
🏁 Script executed:
# Search for any discussion or rationale in commit messages or issues
find . -type f -name "*.md" -o -name "*.txt" -o -name "NOTES*" | xargs grep -i "max-dep\|prefill.*tp.*only" 2>/dev/nullRepository: ishandhanani/srt-slurm
Length of output: 50
🏁 Script executed:
# Compare the blog-post config with max-dep recipes more carefully
echo "=== BLOG POST PREFILL ===" && cat recipes/h100/blog-post.yaml | sed -n '48,65p' && echo -e "\n=== MAX-DEP PREFILL ===" && cat recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml | sed -n '36,40p'Repository: ishandhanani/srt-slurm
Length of output: 713
🏁 Script executed:
# Check when max-dep recipes were created vs blog-post
git log --oneline recipes/h100/blog-post.yaml 2>/dev/null | head -5Repository: ishandhanani/srt-slurm
Length of output: 102
🏁 Script executed:
# Look for any SGLang documentation or issues about prefill parallelism strategy
rg "prefill.*expert\|expert.*parallel.*prefill\|disable.*dp.*attention" --type md --type yaml -B 2 -A 2 | head -60Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
# Check if there's any README in the h100 recipes directory
cat recipes/h100/README.md 2>/dev/null || find recipes/h100 -maxdepth 1 -name "*.md" -o -name "*.txt"Repository: ishandhanani/srt-slurm
Length of output: 48
🌐 Web query:
SGLang disaggregated prefill decode parallelism strategy dp-size expert parallel
💡 Result:
In SGLang PD (Prefill/Decode) disaggregation, the core idea is to separate the compute-bound prefill stage from the memory-/KV-cache-bound decode stage to avoid (1) decode being interrupted by prefill and (2) “DP attention imbalance” where different DP workers end up doing different phases at the same time, increasing decode latency. [1]
How to think about the parallelism knobs
1) --dp-size (with --enable-dp-attention)
- In PD mode, you typically want prefill and decode to each run as a “homogeneous” group, so that within a DP-attention group everyone is doing prefill (on prefill workers) or decode (on decode workers), instead of mixing phases (that’s the imbalance PD is meant to eliminate). [1]
- In SGLang’s DP-attention setup (as documented for DeepSeek), the attention tensor-parallel group size is effectively:
atten_tp_size = tp_size / dp_size. [2]
This matters because it changes the “shape” of the attention parallel group and any features that reuse it (e.g., some context-parallel setups).
2) Prefill-side strategy (compute-bound)
- Favor parallelism that improves throughput on large prompt batches / long prompts, e.g.:
- higher TP and/or PP + chunked prefill (pipeline parallelism for long context).
- SGLang’s PP work explicitly targets long-context TTFT and is designed to be compatible with PD disaggregation, and the roadmap notes configurations with different TP degrees on prefill vs decode. [3]
3) Decode-side strategy (KV/memory-bound)
- Scale decode capacity primarily by adding more decode workers (horizontal scale) and by using MoE-friendly parallelism when applicable.
- For MoE models, SGLang recommends using Expert Parallelism (EP) to distribute expert weights and reduce MoE memory bottlenecks. [4]
- If you use DeepEP/Mooncake for MoE all-to-all (
--moe-a2a-backend deepep|mooncake), note the current constraint: these backends only supportep_size = tp_size(hybridep_size < tp_sizerequires thenonebackend). [4]
A concrete, “known-good” PD + DP-attention + MoE pattern (DeepSeek example)
SGLang’s PD disaggregation docs show multi-node DeepSeek workers launched with the same DP-attention shape on both sides (example uses --tp-size 16 --dp-size 8 --enable-dp-attention) and MoE all-to-all via DeepEP (--moe-a2a-backend deepep). [1]
Rule of thumb:
- Prefill workers: optimize for compute/TTFT (TP/PP/chunking), keep DP-attention groups phase-homogeneous.
- Decode workers: optimize for KV-cache / MoE memory & latency; for MoE, prefer EP, and if using DeepEP/Mooncake keep
ep == tp. [1][4]
Sources: [1] PD disaggregation docs, incl. DP-attention imbalance + DeepSeek PD launch example. [2] DeepSeek V3.2 docs (atten_tp_size = tp_size / dp_size). [3] SGLang PP roadmap issue (PP compatible with PD, differing TP degrees prefill vs decode). [4] SGLang expert parallelism docs (DeepEP/Mooncake constraint ep_size = tp_size, backend flags).
Prefill parallelism settings are misaligned with SGLang PD disaggregation best practices.
The current prefill config uses TP-only (dp-size: 1, ep-size: 1, enable-dp-attention: false) while decode uses full DEP (dp-size: 16, ep-size: 16, enable-dp-attention: true). SGLang's documented PD disaggregation pattern requires prefill and decode to maintain phase-homogeneous DP-attention groups to avoid attention imbalance. The blog-post config in this repository and SGLang's official guidance both use matching DP settings across both phases (e.g., dp-size: 16 on both).
Recommended fix
- dp-size: 1
- ep-size: 1
- enable-dp-attention: false
+ dp-size: 16
+ ep-size: 16
+ enable-dp-attention: true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Parallelism | |
| tp-size: 16 | |
| dp-size: 1 | |
| ep-size: 1 | |
| enable-dp-attention: false | |
| # Parallelism | |
| tp-size: 16 | |
| dp-size: 16 | |
| ep-size: 16 | |
| enable-dp-attention: true |
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-dep-mtp.yaml` around lines 36 - 40,
Prefill parallelism currently sets tp-size: 16, dp-size: 1, ep-size: 1,
enable-dp-attention: false which mismatches the decode phase and violates SGLang
PD disaggregation best practices; update the prefill section to use the same
DP/EP and DP-attention settings as decode (e.g., set dp-size: 16, ep-size: 16
and enable-dp-attention: true) so phase-homogeneous DP-attention groups are
maintained (adjust tp-size only if decode differs).
| name: "h100-fp8-1p1d-max-tp-mtp" | ||
|
|
||
| model: | ||
| path: "dsfp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130" | ||
| precision: "fp8" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and explore other H100 recipes to see which container tag pattern is used
find recipes -name "*.yaml" -path "*/h100/*" | head -20Repository: ishandhanani/srt-slurm
Length of output: 699
🏁 Script executed:
# Check the full content of the file in question
cat -n recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 3385
🏁 Script executed:
# Search for container tag patterns across all H100 recipes
rg "container:" recipes/h100/ -A 0 | head -20Repository: ishandhanani/srt-slurm
Length of output: 1254
🌐 Web query:
SGLang sglang v0.5.8 cu130 runtime image differences lmsysorg/sglang container tags
💡 Result:
How to read lmsysorg/sglang tags (with focus on v0.5.8 + cu130 + runtime)
1) *-runtime vs non-runtime
...-runtimeimages are intended for production and are significantly smaller (~40% reduction) because they exclude build tools and development dependencies. (docs.sglang.io)- Non-
runtimeimages (e.g.,latest) are “full” images that include more dev/build tooling (better for debugging or building extra deps inside the container). (docs.sglang.io)
2) What cu130 means
cu130means the image is built for CUDA 13 (cu130). SGLang docs explicitly recommend usinglmsysorg/sglang:latest-cu130-runtimefor CUDA 13 environments (anddev-cu13for nightly/dev). (docs.sglang.io)- In practice, the CUDA-13 runtime stack in these images includes CUDA 13.0.x user-space libraries (and requires a compatible NVIDIA driver on the host, as usual for CUDA containers).
3) What’s inside a cu130-runtime image (example from the official lineage)
A publicly indexed copy of lmsysorg/sglang:v0.5.7-cu130-runtime shows the typical CUDA-13 runtime base/contents used for these tags, including:
- CUDA_VERSION=13.0.1
- Ubuntu 24.04 base
- Python 3.12
- cuDNN 9.13.x (CUDA 13 build) (docker.aityp.com)
This is the cu130-runtime “family” you should expectv0.5.8-cu130-runtimeto follow as well (same intent and very similar base stack), even if the exact patch versions can vary across builds.
4) dev-cu13 / nightly CUDA-13 tags
SGLang’s CUDA-13 docker workflow builds and publishes these tags:
lmsysorg/sglang:dev-x86-cu13lmsysorg/sglang:dev-arm64-cu13- a multi-arch manifest
lmsysorg/sglang:dev-cu13 - plus nightly stamp tags like
nightly-dev-cu13-YYYYMMDD-<sha>(github.com)
These track the latest code rather than a fixed release likev0.5.8.
5) Version-pinned tags
- Tags starting with
v0.5.8-...are pinned to the v0.5.8 release line (for example, docs referencelmsysorg/sglang:v0.5.8-rocm700-mi30xas a v0.5.8-specific build for that platform). (docs.sglang.io)
If you tell me which tags you’re comparing against (e.g., v0.5.8-cu130 vs v0.5.8-cu130-runtime vs latest-cu130-runtime), I can summarize the differences in a small table.
Citations:
- 1: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 2: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 3: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 4: https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime?utm_source=openai
- 5: https://github.com/sgl-project/sglang/actions/runs/21652742322/workflow
- 6: https://docs.sglang.io/basic_usage/deepseek_v32.html?utm_source=openai
Verify max-prefill-tokens aligns with benchmark input sequence length.
The max-prefill-tokens is set to 2048 (line 58), but the benchmark uses isl: 8192 (line 108). This 4× mismatch could cause prefill to reject or stall requests during benchmarking. Confirm whether this is intentional (e.g., chunked prefill handles overflow) or if max-prefill-tokens should be increased to at least 8192.
Regarding the container tag: all other H100 recipes in the repository consistently use lmsysorg/sglang:v0.5.8-cu130 (non-runtime), so this file is aligned with the existing codebase pattern.
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml` around lines 1 - 6, The
recipe sets max-prefill-tokens to 2048 while the benchmark input sequence length
(isl) is 8192, causing a 4× mismatch that can reject or stall prefill; update
the max-prefill-tokens value (the YAML key max-prefill-tokens) to at least 8192
or implement/confirm chunked prefill behavior that safely handles isl=8192
(adjust associated prefill/chunking logic if present in the code paths that read
max-prefill-tokens and isl); verify the change by ensuring the configuration
(max-prefill-tokens) and the benchmark parameter (isl) in this recipe are
consistent.
| # Memory and token limits | ||
| mem-fraction-static: 0.6 | ||
| max-prefill-tokens: 2048 | ||
| chunked-prefill-size: 2048 |
There was a problem hiding this comment.
max-prefill-tokens is too low for the 8k ISL benchmark.
isl: 8192 exceeds max-prefill-tokens: 2048, so 8k inputs will likely be rejected or truncated. Bump the max to at least 8192 (chunk size can stay 2048).
✅ Suggested fix
- max-prefill-tokens: 2048
+ max-prefill-tokens: 8192Also applies to: 108-109
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/mtp/h100-fp8-1p1d-max-tp-mtp.yaml` around lines 56 - 59,
The max-prefill-tokens value is too low for isl: 8192; update the
max-prefill-tokens setting (currently 2048) to at least 8192 so 8k inputs won't
be rejected or truncated, keep chunked-prefill-size at 2048, and make the same
change for the other max-prefill-tokens occurrence in this recipe; reference the
mem-fraction-static, max-prefill-tokens, and chunked-prefill-size entries when
locating the lines to edit.
| # Memory and token limits | ||
| mem-fraction-static: 0.6 | ||
| max-prefill-tokens: 2048 | ||
| chunked-prefill-size: 2048 |
There was a problem hiding this comment.
Raise max-prefill-tokens to cover the 8k benchmark input.
benchmark.isl is 8192 while max-prefill-tokens is 2048, so 8k requests will be rejected or truncated.
✅ Proposed fix
- max-prefill-tokens: 2048
+ max-prefill-tokens: 8192📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Memory and token limits | |
| mem-fraction-static: 0.6 | |
| max-prefill-tokens: 2048 | |
| chunked-prefill-size: 2048 | |
| # Memory and token limits | |
| mem-fraction-static: 0.6 | |
| max-prefill-tokens: 8192 | |
| chunked-prefill-size: 2048 |
🤖 Prompt for AI Agents
In `@recipes/h100/8k1k/stp/h100-fp8-1p1d-max-tp.yaml` around lines 54 - 57, The
max-prefill-tokens setting is too low for the 8k benchmark (benchmark.isl =
8192); update the max-prefill-tokens value in the YAML (parameter name:
max-prefill-tokens) to at least 8192 (or higher if you want headroom) so 8k
prefill requests are not rejected or truncated, and keep chunked-prefill-size
aligned if needed.
Add H100 FP8 SGLang Disaggregated Recipes for DeepSeek-R1
Recipes Added
1k1k
8k1k
1k8k
Key Settings
lmsysorg/sglang:v0.5.8-cu130-runtimelmsysorg/sglang:v0.5.8-cu130-runtimeSummary by CodeRabbit
Chores
New Features