Add all GB200/GB300 FP8 MTP recipes#134
Conversation
📝 WalkthroughWalkthroughAdds many new GB300 and GB200 FP8 deployment recipe YAMLs and updates one GB200 YAML: configuration-only changes for model/runtime images, frontend, resources, per-stage prefill/decode envs, extensive sglang_config tuning (DeepEP/MTP/disaggregation), CUDA-graph lists, and benchmark stanzas. Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@recipes/gb300-fp8/1k1k/mtp/max.yaml`:
- Around line 1-19: Add a missing slurm section to this recipe so it matches
other recipes (e.g., 1k1k/mtp/mid.yaml): inside the root document add a
top-level "slurm:" mapping containing the same keys used in other recipes (for
example partition, gres/node, cpus_per_task, and any node-specific settings) to
mirror the structure used alongside "model", "frontend", and "resources"; ensure
the section is named "slurm" and placed at root level so tools reading "name",
"model", "frontend", and "resources" will find it consistently.
In `@recipes/gb300-fp8/1k1k/mtp/mid.yaml`:
- Around line 1-19: The recipe is missing a slurm block which other recipes
include; add a slurm section after the existing resources block (same file's
top-level keys: name, model, frontend, resources) and include at minimum
time_limit: "04:00:00" (you can also add other standard keys like partition or
gres if required by your deployment); ensure the new slurm key is top-level and
follows the same YAML indentation/style as the existing blocks so the parser
recognizes it.
In `@recipes/gb300-fp8/8k1k/mtp/max.yaml`:
- Around line 13-19: The configured tensor/ data/ expert parallelism (tp, dp,
ep) massively oversubscribes GPUs: compute total_gpus = prefill_nodes *
gpus_per_node for prefill and decode_nodes * gpus_per_node for decode, then
ensure tp * dp * ep <= total_gpus (or increase nodes/gpus accordingly). Update
the tp/dp/ep values in the prefill and decode sections (and mirror the same
changes in the duplicate blocks referenced around lines 71-75 and 131-134) so
the product of tp×dp×ep does not exceed available GPUs, or alternatively
increase prefill_nodes/decode_nodes or gpus_per_node to match the desired
parallelism. Ensure consistency between prefill_workers/decode_workers and the
adjusted parallelism so world_size calculations in SGLang remain correct.
- Around line 67-69: The recipe sets served-model-name "deepseek-ai/DeepSeek-R1"
with trust-remote-code: true without a pinned revision; update the configuration
to either pin the model reference (e.g., append @<commit-hash> to the
served-model-name), or change the source to an internal mirror, or add a clear
security rationale comment in the recipe explaining why trust-remote-code is
acceptable and what vetting was performed; apply the same fix for any other
blocks that set trust-remote-code alongside served-model-name in this recipe.
In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml`:
- Around line 67-70: The recipe references the remote repo
"deepseek-ai/DeepSeek-R1" with trust-remote-code: true which allows arbitrary
remote code execution; update both occurrences of the served model reference in
the prefill and decode sections (the "served-model-name" fields) to pin to a
specific commit or tag (e.g., "deepseek-ai/DeepSeek-R1@<commit_sha>") or switch
to an internally mirrored model ID, and if you intentionally must keep an
unpinned reference, add a short documented rationale and note any compensating
controls (sandboxing, code review process) adjacent to the served-model-name and
trust-remote-code settings so reviewers can verify the risk mitigation.
- Around line 1-12: The container tag specified in model.container
("lmsysorg/sglang:v0.5.8-cu130-runtime") appears invalid; update the tag to a
known-public image (e.g., "lmsysorg/sglang:latest-cu130-runtime" or
"lmsysorg/sglang:v0.5.7-cu130-runtime") or confirm and document that
"v0.5.8-cu130-runtime" exists in your private registry; change the
model.container value accordingly and ensure any deployment/configuration docs
reference the chosen tag.
🧹 Nitpick comments (5)
recipes/gb300-fp8/1k8k/mtp/max.yaml (1)
175-176:cuda-graph-bslist contains batch sizes exceedingcuda-graph-max-bs.The
cuda-graph-bslist includes values up to 1024, butcuda-graph-max-bsis set to 512. Batch sizes beyond 512 in the list (544, 576, 608, 640, 672, 704, 736, 768, 1024) will not be used for CUDA graph capture. Consider removing these unused values to improve clarity, or adjustcuda-graph-max-bsif larger batch sizes are intended.recipes/gb300-fp8/1k8k/mtp/mid.yaml (1)
175-176:cuda-graph-bslist contains unused batch sizes.Same issue as max.yaml: the list includes batch sizes up to 1024 while
cuda-graph-max-bsis 512.recipes/gb300-fp8/1k1k/mtp/max.yaml (1)
172-173:cuda-graph-bslist contains unused batch sizes.Same issue as other max/mid configs: the list includes batch sizes up to 1024 while
cuda-graph-max-bsis 512. Consider trimming the list or adjusting max-bs.recipes/gb300-fp8/8k1k/mtp/max.yaml (1)
23-63: Confirm the DG cache directory is mounted for both stages.Both prefill and decode set
SGLANG_DG_CACHE_DIR: /configs/dg-10212025; please ensure this path is present and mounted into the container on all prefill/decode nodes to avoid startup failures or cache misses.recipes/gb300-fp8/8k1k/mtp/mid.yaml (1)
23-63: Confirm the DG cache directory is mounted for both stages.Both prefill and decode set
SGLANG_DG_CACHE_DIR: /configs/dg-10212025; please ensure this path is present and mounted into the container on all prefill/decode nodes to avoid startup failures or cache misses.
| # GB300 FP8 Max Throughput Configuration | ||
|
|
||
| name: "gb300-1k1k-fp8-max" | ||
|
|
||
| model: | ||
| path: "dsfp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp8" | ||
|
|
||
| frontend: | ||
| nginx_container: nginx | ||
|
|
||
| resources: | ||
| gpu_type: "gb300" | ||
| prefill_nodes: 2 | ||
| prefill_workers: 1 | ||
| decode_nodes: 2 | ||
| decode_workers: 1 | ||
| gpus_per_node: 4 |
There was a problem hiding this comment.
Missing slurm section.
Similar to 1k1k/mtp/mid.yaml, this file is missing the slurm section. Please add it for consistency with other recipe files.
Suggested fix
resources:
gpu_type: "gb300"
prefill_nodes: 2
prefill_workers: 1
decode_nodes: 2
decode_workers: 1
gpus_per_node: 4
+slurm:
+ time_limit: "04:00:00"
+
backend:🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/1k1k/mtp/max.yaml` around lines 1 - 19, Add a missing slurm
section to this recipe so it matches other recipes (e.g., 1k1k/mtp/mid.yaml):
inside the root document add a top-level "slurm:" mapping containing the same
keys used in other recipes (for example partition, gres/node, cpus_per_task, and
any node-specific settings) to mirror the structure used alongside "model",
"frontend", and "resources"; ensure the section is named "slurm" and placed at
root level so tools reading "name", "model", "frontend", and "resources" will
find it consistently.
| # GB300 FP8 Mid Throughput Configuration | ||
| name: "gb300-1k1k-fp8-mid" | ||
|
|
||
| model: | ||
| path: "dsfp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp8" | ||
|
|
||
| frontend: | ||
| nginx_container: nginx | ||
|
|
||
| resources: | ||
| gpu_type: "gb300" | ||
| prefill_nodes: 4 | ||
| prefill_workers: 2 | ||
| decode_nodes: 8 | ||
| decode_workers: 1 | ||
| gpus_per_node: 4 | ||
|
|
There was a problem hiding this comment.
Missing slurm section.
This configuration file is missing the slurm section that other recipe files include (e.g., time_limit: "04:00:00"). This may cause issues or use unexpected defaults when the recipe is deployed.
Suggested fix
Add the slurm section after the resources block:
resources:
gpu_type: "gb300"
prefill_nodes: 4
prefill_workers: 2
decode_nodes: 8
decode_workers: 1
gpus_per_node: 4
+slurm:
+ time_limit: "04:00:00"
+
backend:🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/1k1k/mtp/mid.yaml` around lines 1 - 19, The recipe is
missing a slurm block which other recipes include; add a slurm section after the
existing resources block (same file's top-level keys: name, model, frontend,
resources) and include at minimum time_limit: "04:00:00" (you can also add other
standard keys like partition or gres if required by your deployment); ensure the
new slurm key is top-level and follows the same YAML indentation/style as the
existing blocks so the parser recognizes it.
| resources: | ||
| gpu_type: "gb300" | ||
| prefill_nodes: 12 | ||
| prefill_workers: 6 | ||
| decode_nodes: 6 | ||
| decode_workers: 1 | ||
| gpus_per_node: 4 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find and read the target file
if [ -f "recipes/gb300-fp8/8k1k/mtp/max.yaml" ]; then
echo "=== File exists, reading content ==="
cat -n recipes/gb300-fp8/8k1k/mtp/max.yaml
else
echo "=== File not found, searching for similar patterns ==="
find . -name "*.yaml" -path "*/gb300*" -type f 2>/dev/null | head -20
fiRepository: ishandhanani/srt-slurm
Length of output: 6697
🏁 Script executed:
#!/bin/bash
# Check if recipes/gb300-fp8 exists and list yaml files
if [ -d "recipes/gb300-fp8" ]; then
echo "=== Directory structure ==="
find recipes/gb300-fp8 -name "*.yaml" -type f | head -20
else
echo "=== Searching for gb300 recipes ==="
find . -type d -name "*gb300*" 2>/dev/null
fiRepository: ishandhanani/srt-slurm
Length of output: 789
🏁 Script executed:
#!/bin/bash
# Verify tp/dp/ep sizing vs available GPUs across gb300 recipes (assumes world_size = tp*dp*ep).
python3 <<'PY'
from pathlib import Path
import yaml
def world(cfg):
tp = int(cfg.get("tp-size", 1))
dp = int(cfg.get("dp-size", 1))
ep = int(cfg.get("ep-size", 1))
return tp * dp * ep, tp, dp, ep
for path in sorted(Path("recipes/gb300-fp8").rglob("*.yaml")):
try:
data = yaml.safe_load(path.read_text())
res = data.get("resources", {})
gpn = res.get("gpus_per_node")
pn = res.get("prefill_nodes")
dn = res.get("decode_nodes")
prefill_gpus = gpn * pn if gpn and pn else None
decode_gpus = gpn * dn if gpn and dn else None
sg = data.get("backend", {}).get("sglang_config", {})
pre = sg.get("prefill", {})
dec = sg.get("decode", {})
if pre and prefill_gpus:
w, tp, dp, ep = world(pre)
if w > prefill_gpus:
print(f"{path.name}: prefill world_size {w} > prefill_gpus {prefill_gpus} (tp={tp},dp={dp},ep={ep})")
if dec and decode_gpus:
w, tp, dp, ep = world(dec)
if w > decode_gpus:
print(f"{path.name}: decode world_size {w} > decode_gpus {decode_gpus} (tp={tp},dp={dp},ep={ep})")
except Exception as e:
pass
PYRepository: ishandhanani/srt-slurm
Length of output: 1717
Parallelism factors exceed available GPUs by orders of magnitude.
Prefill: 48 GPUs (12 nodes × 4) configured with tp/dp/ep = 8/8/8 = 512 ranks (10.7× oversubscription). Decode: 24 GPUs (6 nodes × 4) configured with tp/dp/ep = 24/24/24 = 13,824 ranks (576× oversubscription). If SGLang uses world_size = tp × dp × ep, this configuration will fail to launch. Reduce tp/dp/ep values or scale GPU resources accordingly.
Also applies to lines 71-75 (prefill parallelism) and 131-134 (decode parallelism).
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/8k1k/mtp/max.yaml` around lines 13 - 19, The configured
tensor/ data/ expert parallelism (tp, dp, ep) massively oversubscribes GPUs:
compute total_gpus = prefill_nodes * gpus_per_node for prefill and decode_nodes
* gpus_per_node for decode, then ensure tp * dp * ep <= total_gpus (or increase
nodes/gpus accordingly). Update the tp/dp/ep values in the prefill and decode
sections (and mirror the same changes in the duplicate blocks referenced around
lines 71-75 and 131-134) so the product of tp×dp×ep does not exceed available
GPUs, or alternatively increase prefill_nodes/decode_nodes or gpus_per_node to
match the desired parallelism. Ensure consistency between
prefill_workers/decode_workers and the adjusted parallelism so world_size
calculations in SGLang remain correct.
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "max.yaml" -path "*/gb300-fp8/8k1k/mtp/*" 2>/dev/null | head -20Repository: ishandhanani/srt-slurm
Length of output: 103
🏁 Script executed:
wc -l ./recipes/gb300-fp8/8k1k/mtp/max.yamlRepository: ishandhanani/srt-slurm
Length of output: 107
🏁 Script executed:
cat -n ./recipes/gb300-fp8/8k1k/mtp/max.yaml | sed -n '60,135p'Repository: ishandhanani/srt-slurm
Length of output: 2539
🏁 Script executed:
grep -r "revision\|DeepSeek" ./recipes/gb300-fp8/8k1k/mtp/ 2>/dev/nullRepository: ishandhanani/srt-slurm
Length of output: 615
🏁 Script executed:
grep -r "trust-remote-code" ./recipes/gb300-fp8/8k1k/mtp/ 2>/dev/nullRepository: ishandhanani/srt-slurm
Length of output: 489
🏁 Script executed:
grep -r "deepseek\|security\|revision" ./recipes/ 2>/dev/null | grep -i "doc\|readme\|comment\|note" | head -20Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
find ./recipes -name "README*" -o -name "*.md" | head -10Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
find . -maxdepth 2 -name "README*" -o -name "*.md" | head -10Repository: ishandhanani/srt-slurm
Length of output: 239
🏁 Script executed:
grep -r "trust-remote-code\|DeepSeek" ./docs/ ./README.md 2>/dev/null | head -20Repository: ishandhanani/srt-slurm
Length of output: 551
🏁 Script executed:
grep -B5 -A5 "trust-remote-code" ./recipes/gb300-fp8/8k1k/mtp/max.yaml | head -30Repository: ishandhanani/srt-slurm
Length of output: 569
🏁 Script executed:
grep -r "skip-tokenizer-init\|trust-remote" . --include="*.md" --include="*.txt" 2>/dev/null | head -10Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
cat ./docs/installation.md | grep -A10 -B10 "DeepSeek\|trust"Repository: ishandhanani/srt-slurm
Length of output: 1217
🏁 Script executed:
cat ./docs/config-reference.md | head -100Repository: ishandhanani/srt-slurm
Length of output: 2655
🏁 Script executed:
find . -type f \( -name "*.yaml" -o -name "*.yml" \) -path "*/recipes/*" -exec grep -l "deepseek-ai" {} \; | xargs grep -B2 -A8 "trust-remote-code" | head -40Repository: ishandhanani/srt-slurm
Length of output: 2949
🏁 Script executed:
grep -r "trust.remote\|remote.code" . --include="*.md" --include="*.txt" 2>/dev/nullRepository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
find . -name "CHANGELOG*" -o -name "SECURITY*" -o -name "SECURITY.md" | head -5Repository: ishandhanani/srt-slurm
Length of output: 48
Pin model revision or document the security rationale for trust-remote-code: true.
The deepseek-ai/DeepSeek-R1 model loaded with trust-remote-code: true at lines 67-69 and 125-127 allows arbitrary code execution from the model repository without a pinned revision. If this setting is required, either:
- Pin a specific vetted commit hash (e.g.,
deepseek-ai/DeepSeek-R1@<commit-hash>) - Mirror the model internally and load from a controlled source
- Document the security assessment and why the risk is acceptable
This also applies to similar configurations in other recipe files.
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/8k1k/mtp/max.yaml` around lines 67 - 69, The recipe sets
served-model-name "deepseek-ai/DeepSeek-R1" with trust-remote-code: true without
a pinned revision; update the configuration to either pin the model reference
(e.g., append @<commit-hash> to the served-model-name), or change the source to
an internal mirror, or add a clear security rationale comment in the recipe
explaining why trust-remote-code is acceptable and what vetting was performed;
apply the same fix for any other blocks that set trust-remote-code alongside
served-model-name in this recipe.
| # GB300 FP8 Mid Throughput Configuration | ||
|
|
||
| name: "gb300-8k1k-fp8-mid" | ||
|
|
||
| model: | ||
| path: "dsfp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp8" | ||
|
|
||
| frontend: | ||
| nginx_container: nginx | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
SGLang v0.5.8 cu130 GB300 FP8 support release notes
💡 Result:
SGLang v0.5.8 (Jan 23, 2026) — upstream release notes
GitHub tag v0.5.8 highlights include DeepSeek V3.2 optimizations (incl. FP8 KV cache mention), FlashAttention 4 decoding kernels, SGLang-Diffusion updates, and dependency bumps (e.g., sgl-kernel → 0.3.21). [1]
CUDA 13.0 (“cu130”) + GB300 + FP8 — release notes / support notes
- NVIDIA’s SGLang container release notes (CUDA 13.0.x-based) explicitly list CUDA 13.0 compatibility, GB300/B300 support, and FP8 precision support (Hopper and above). [2]
- SGLang installation docs note that for CUDA 13 / B300 / GB300, Docker is recommended, and alternatively you can install the matching
sgl_kernel+cu130wheel from the sgl-project wheel releases. [3] - NVIDIA forum post also points to the cu130 kernel wheel index for SGLang. [4]
Sources:
[1] (github.com)
[2] (docs.nvidia.com)
[3] (docs.sglang.io)
[4] (forums.developer.nvidia.com)
Citations:
- 1: https://github.com/sgl-project/sglang/releases
- 2: https://docs.nvidia.com/deeplearning/frameworks/sglang-release-notes/rel-25-11.html?utm_source=openai
- 3: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 4: https://forums.developer.nvidia.com/t/run-sglang-in-thor/348815/3?utm_source=openai
🌐 Web query:
SGLang GB300 support when added
💡 Result:
SGLang’s GB300/B300 support is called out as added in NVIDIA’s SGLang container Release 25.11 (“GB300/B300 support” is listed under Key Features and Enhancements). The NVIDIA release note page was last updated Jan 30, 2026. [1]
SGLang’s own install docs also include a troubleshooting note specifically for B300/GB300 (“ptxas fatal… sm_103a…”) indicating operational support in current documentation. [2]
Sources: [1] [2]
🌐 Web query:
lmsysorg sglang v0.5.8 cu130 runtime container
💡 Result:
I couldn’t find evidence that a lmsysorg/sglang:v0.5.8-cu130-runtime image tag exists on Docker Hub (at least publicly indexed). The CUDA 13.0 “runtime” images that are referenced/available are:
- Stable alias (recommended for CUDA 13 setups):
lmsysorg/sglang:latest-cu130-runtime(docs.sglang.io) - A known versioned CUDA 13 runtime tag:
lmsysorg/sglang:v0.5.7-cu130-runtime(docker.aityp.com)
Commands:
docker pull lmsysorg/sglang:latest-cu130-runtime
# or
docker pull lmsysorg/sglang:v0.5.7-cu130-runtimeCitations:
- 1: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 2: https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime?utm_source=openai
The SGLang container tag v0.5.8-cu130-runtime does not appear to exist; use latest-cu130-runtime or verify this is a private/internal image.
GB300 and FP8 support are confirmed in NVIDIA SGLang Release 25.11 with CUDA 13.0, but the specified versioned tag v0.5.8-cu130-runtime was not found on public Docker Hub. Available alternatives are latest-cu130-runtime (stable, recommended) or v0.5.7-cu130-runtime. Confirm whether the v0.5.8 tag is available in your private registry or switch to an existing public tag to avoid deployment failure.
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml` around lines 1 - 12, The container tag
specified in model.container ("lmsysorg/sglang:v0.5.8-cu130-runtime") appears
invalid; update the tag to a known-public image (e.g.,
"lmsysorg/sglang:latest-cu130-runtime" or
"lmsysorg/sglang:v0.5.7-cu130-runtime") or confirm and document that
"v0.5.8-cu130-runtime" exists in your private registry; change the
model.container value accordingly and ensure any deployment/configuration docs
reference the chosen tag.
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the YAML file
find . -type f -name "mid.yaml" | head -5Repository: ishandhanani/srt-slurm
Length of output: 255
🏁 Script executed:
# Also search for the file path pattern
find . -type f -path "*gb300-fp8*" -name "*.yaml" | head -10Repository: ishandhanani/srt-slurm
Length of output: 477
🌐 Web query:
Hugging Face trust_remote_code security risks and best practices
💡 Result:
What trust_remote_code=True does (and why it’s risky)
- It allows executing Python code pulled from a Hugging Face Hub repo (e.g., custom
modeling_*.py,tokenization_*.py,pipeline.py) when loading with Transformers “Auto” classes (and similarly enables execution of dataset loading scripts indatasets). This can lead to arbitrary code execution (RCE) if the repo is malicious or compromised. [1] - Even without remote code, loading pickled weight formats (common in PyTorch
.bin/ some checkpoints) can also enable code execution during deserialization. Hugging Face explicitly warns that pickle can be abused for arbitrary code execution. [2]
Threat model / concrete attack paths
- Malicious custom modeling/pipeline code runs at import/load time once you opt in with
trust_remote_code=True. [1] - Repo updates as a supply-chain risk: if you load by branch name (e.g., default
main) and don’t pin a commit, a later update could introduce malicious code. Hugging Face recommends pinning a revision to protect against repo updates. [1] - Pickle-based exploits in weight files can bypass expectations; HF provides scanning and warnings, but also states their approach is not foolproof. [2]
- Platform reality: the Hub scans files for malware on every commit, but that’s not a guarantee of safety. [3]
Best practices (practical checklist)
- Don’t use
trust_remote_code=Trueunless you must. Prefer models that work with built-in Transformers architectures. [1] - If you must use it, review the repo code first (the exact Python files that will be imported/executed). HF explicitly recommends verifying modeling files before enabling it. [1]
- Pin to an immutable revision (commit hash or tag) when using remote code, to avoid silently pulling new code later. [1]
- Example (conceptual): load with
revision="<commit_sha>"and keep that SHA under change control.
- Example (conceptual): load with
- Prefer
safetensorsand enforce it:- In Transformers, use
use_safetensors=Trueto avoid unsafe weight formats (and fail if only unsafe weights exist). [1] - More generally, prefer safetensors over pickle-based weights. [2]
- In Transformers, use
- Treat Hub scanning as defense-in-depth, not trust:
- HF runs malware scanning on repository files at each commit. [3]
- HF also performs pickle-related scanning/inspection and highlights suspicious imports, but it’s not a complete guarantee. [2]
- Consider additional scanning (HF integrates third-party scanning like Protect AI). [4]
- Sandbox execution when evaluating untrusted/community repos:
- Run in an isolated container/VM, least-privilege user, read-only mounts where possible, and restrict outbound network/credentials (because remote code runs as you).
- For datasets: avoid executing dataset loading scripts unless necessary;
datasetsintroducedtrust_remote_codefor custom dataset code and provides an environment variable to disable trusting remote code by default (HF_DATASETS_TRUST_REMOTE_CODE=0). [5]
Key takeaway
trust_remote_code=True is an explicit opt-in to run third-party Python from a model/dataset repo—so handle it like any other supply-chain dependency: review, pin, sandbox, and avoid pickle where possible. [1][2]
Sources:
[1] Transformers security policy (remote code + pin revision + safetensors guidance) (github.com)
[2] Hugging Face Hub “Pickle Scanning” (pickle RCE risk + mitigations + not foolproof) (huggingface.co)
[3] Hugging Face Hub “Malware Scanning” (scans every file each commit) (huggingface.co)
[4] Hugging Face Hub “Third-party scanner: Protect AI” (huggingface.co)
[5] datasets release notes describing trust_remote_code and HF_DATASETS_TRUST_REMOTE_CODE=0 (newreleases.io)
Citations:
- 1: https://github.com/huggingface/transformers/security/policy?utm_source=openai
- 2: https://huggingface.co/docs/hub/security-pickle?utm_source=openai
- 3: https://huggingface.co/docs/hub/security-malware?utm_source=openai
- 4: https://huggingface.co/docs/hub/en/security-protectai?utm_source=openai
- 5: https://newreleases.io/project/github/huggingface/datasets/release/2.16.0?utm_source=openai
🏁 Script executed:
# Read the YAML file to see the context around lines 67-70 and 125-127
cat -n ./recipes/gb300-fp8/8k1k/mtp/mid.yamlRepository: ishandhanani/srt-slurm
Length of output: 6651
Pin the model revision or mirror internally; trust-remote-code: true without a pinned commit hash enables arbitrary code execution.
The deepseek-ai/DeepSeek-R1 reference (lines 67, 125) loads from the latest repository state; any subsequent update to the model repo could execute malicious code. Specify a commit hash or tag (e.g., "deepseek-ai/DeepSeek-R1@<commit_sha>") to prevent silent code execution, mirror the model internally if supported by your loader, or document the rationale and any additional security controls (e.g., sandboxing, code review process).
This applies to both prefill (lines 67–70) and decode (lines 125–127) sections.
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml` around lines 67 - 70, The recipe
references the remote repo "deepseek-ai/DeepSeek-R1" with trust-remote-code:
true which allows arbitrary remote code execution; update both occurrences of
the served model reference in the prefill and decode sections (the
"served-model-name" fields) to pin to a specific commit or tag (e.g.,
"deepseek-ai/DeepSeek-R1@<commit_sha>") or switch to an internally mirrored
model ID, and if you intentionally must keep an unpinned reference, add a short
documented rationale and note any compensating controls (sandboxing, code review
process) adjacent to the served-model-name and trust-remote-code settings so
reviewers can verify the risk mitigation.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml`:
- Around line 77-81: The recipe sets context-length to 10000 but
max-total-tokens to 8192, which risks rejecting/truncating full-length requests;
update either max-total-tokens to at least 10000 (or higher to allow headroom)
or reduce context-length to <=8192 so they align—modify the values for
context-length and/or max-total-tokens in this config (look for the entries
named "context-length" and "max-total-tokens") to ensure max-total-tokens >=
context-length.
In `@recipes/gb200-fp8/8k1k/low-latency-mtp.yaml`:
- Around line 1-12: The recipe name's node-count suffix ("name":
"gb200-fp8-8k1k-1p-1d-low-latency-mtp") does not match the declared resources
(decode_nodes: 2); either update the name to reflect 1p-2d (change "1p-1d" to
"1p-2d") or change decode_nodes to 1 so the "name" and the resources align; edit
the top-level name field or the decode_nodes field accordingly to resolve the
mismatch.
🧹 Nitpick comments (2)
recipes/gb200-fp8/8k1k/low-latency-mtp.yaml (1)
16-58: Consider YAML anchors to reduce env duplication.Prefill and decode environments are nearly identical. Anchors make it harder for them to drift.
♻️ Example anchor pattern
backend: - prefill_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - PYTHONUNBUFFERED: "1" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + common_environment: &common_env + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + # ...keep the remaining shared keys here... + prefill_environment: + <<: *common_env SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" # ... - decode_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - PYTHONUNBUFFERED: "1" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + decode_environment: + <<: *common_env SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" # ...recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yaml (1)
3-3: Align recipenamewith the filename/tier for discoverability.The file path says “mid-curve-5p1d”, but the name reads “mid-tpt”. Consider renaming for clarity.
| context-length: 10000 | ||
| disaggregation-mode: "prefill" | ||
| mem-fraction-static: 0.95 | ||
| max-total-tokens: 8192 | ||
| chunked-prefill-size: 8192 |
There was a problem hiding this comment.
max-total-tokens is lower than context-length.
This can cause full-length requests to be rejected or truncated. Please align these values (either raise max-total-tokens or reduce context-length).
🐛 Example fix (raise token cap)
- max-total-tokens: 8192
+ max-total-tokens: 10000📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| context-length: 10000 | |
| disaggregation-mode: "prefill" | |
| mem-fraction-static: 0.95 | |
| max-total-tokens: 8192 | |
| chunked-prefill-size: 8192 | |
| context-length: 10000 | |
| disaggregation-mode: "prefill" | |
| mem-fraction-static: 0.95 | |
| max-total-tokens: 10000 | |
| chunked-prefill-size: 8192 |
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml` around lines 77 - 81, The recipe
sets context-length to 10000 but max-total-tokens to 8192, which risks
rejecting/truncating full-length requests; update either max-total-tokens to at
least 10000 (or higher to allow headroom) or reduce context-length to <=8192 so
they align—modify the values for context-length and/or max-total-tokens in this
config (look for the entries named "context-length" and "max-total-tokens") to
ensure max-total-tokens >= context-length.
| name: "gb200-fp8-8k1k-1p-1d-low-latency-mtp" | ||
|
|
||
| model: | ||
| path: "dsfp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp8" | ||
|
|
||
| resources: | ||
| gpu_type: "gb200" | ||
| prefill_nodes: 1 | ||
| decode_nodes: 2 | ||
| prefill_workers: 1 |
There was a problem hiding this comment.
Align the recipe name with the node counts.
The name says 1p-1d but decode_nodes: 2. Either rename the recipe to 1p-2d or change decode_nodes to 1 to avoid mislabeling in UIs/logs.
💡 Example fix (rename to match resources)
-name: "gb200-fp8-8k1k-1p-1d-low-latency-mtp"
+name: "gb200-fp8-8k1k-1p-2d-low-latency-mtp"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| name: "gb200-fp8-8k1k-1p-1d-low-latency-mtp" | |
| model: | |
| path: "dsfp8" | |
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | |
| precision: "fp8" | |
| resources: | |
| gpu_type: "gb200" | |
| prefill_nodes: 1 | |
| decode_nodes: 2 | |
| prefill_workers: 1 | |
| name: "gb200-fp8-8k1k-1p-2d-low-latency-mtp" | |
| model: | |
| path: "dsfp8" | |
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | |
| precision: "fp8" | |
| resources: | |
| gpu_type: "gb200" | |
| prefill_nodes: 1 | |
| decode_nodes: 2 | |
| prefill_workers: 1 |
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/8k1k/low-latency-mtp.yaml` around lines 1 - 12, The recipe
name's node-count suffix ("name": "gb200-fp8-8k1k-1p-1d-low-latency-mtp") does
not match the declared resources (decode_nodes: 2); either update the name to
reflect 1p-2d (change "1p-1d" to "1p-2d") or change decode_nodes to 1 so the
"name" and the resources align; edit the top-level name field or the
decode_nodes field accordingly to resolve the mismatch.
c6da362 to
8ca02d6
Compare
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml`:
- Around line 1-3: The top comment incorrectly reads "Max Throughput
Configuration"; update the header comment to match the actual recipe type by
changing it to indicate "Mid-Curve Configuration" (or similar) so it aligns with
the name "gb200-fp8-mid-curve-3p1d-mtp" and the filename; ensure the descriptive
comment at the top reflects mid-curve rather than max-throughput for clarity.
In `@recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml`:
- Line 71: The YAML sets trust-remote-code: true for the DeepSeek-R1 model which
allows arbitrary remote code execution; either add a concise security
justification comment next to trust-remote-code explaining why remote code is
required for DeepSeek-R1 (e.g., necessary custom tokenizer/architecture) or pin
the model reference to a specific commit by changing the model identifier to
include a commit hash (e.g., deepseek-ai/DeepSeek-R1@<commit-hash>), or disable
the flag if not needed; update occurrences that match the same pattern (the
trust-remote-code key and the DeepSeek-R1 model reference) so all instances are
consistently justified, pinned, or disabled.
In `@recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yaml`:
- Around line 68-71: The tp-size, dp-size, and ep-size settings (tp-size,
dp-size, ep-size) currently multiply to 512/32,768 which exceeds the available
GPUs and will prevent launch; update the parallelism triplets at both
occurrences (the block around tp-size: 8 / dp-size: 8 / ep-size: 8 and the
similar block later) so that tp-size * dp-size * ep-size equals the actual
available GPU count for the corresponding stage (e.g., set the three values to a
combination whose product equals 40 for the prefill stage and equals 32 for the
decode stage, or otherwise scale them down to the node/GPU capacity you have).
In `@recipes/gb300-fp8/1k1k/mtp/max.yaml`:
- Around line 13-19: The resources block currently requests 2 nodes × 4 GPUs
(gpus_per_node) but the configured parallelism (tp-size, dp-size, ep-size = 8
each for both prefill and decode) yields 512 ranks, massively oversubscribing
hardware; update either the compute allocations (increase
prefill_nodes/decode_nodes or gpus_per_node) or reduce parallelism dimensions
(tp-size, dp-size, ep-size) so total ranks <= available GPUs (prefill_nodes *
gpus_per_node * prefill_workers and decode_nodes * gpus_per_node *
decode_workers); locate and fix the resources keys (prefill_nodes,
prefill_workers, decode_nodes, decode_workers, gpus_per_node) and the
prefill/decode tp-size, dp-size, ep-size settings to make them consistent.
- Around line 67-69: The model configs enable trust-remote-code without pinning
a revision; update the sglang_config for both the prefill model block
(containing served-model-name/skip-tokenizer-init/trust-remote-code) and the
decode model block to include a revision field that passes the desired pinned
commit/tag/branch via the SGLang --revision parameter (e.g., add "revision":
"<commit-or-tag>") so remote code execution is tied to a specific model version.
In `@recipes/gb300-fp8/1k1k/mtp/mid.yaml`:
- Around line 66-68: The recipe currently sets served-model-name
("deepseek-ai/DeepSeek-R1") with trust-remote-code: true without a pinned
revision; update the served-model-name to include a fixed revision (for example
deepseek-ai/DeepSeek-R1@<commit-hash>) or point it to a locally mirrored copy to
prevent automatic unverified code updates, and apply the same change to the
decode section where the model is referenced (the block containing
skip-tokenizer-init, trust-remote-code and the decode section at lines ~124-126)
so both runtime and decoding use the pinned revision.
- Around line 12-18: The recipe oversubscribes GPUs: current resources
(prefill_nodes, prefill_workers, decode_nodes, decode_workers, gpus_per_node)
produce far fewer physical GPUs than the total training ranks implied by the
parallelism (tp, dp, ep). Fix by either lowering the parallelism dimensions
(reduce tp, dp, and/or ep values used for prefill and decode) so total_ranks ≤
gpus_per_node * nodes * gpus_per_node_per_node, or increase
prefill_nodes/decode_nodes (and/or gpus_per_node) so physical GPU count matches
total_ranks; update the prefill_* and decode_* blocks consistently (also adjust
same pattern at the other occurrences noted around lines 71-74 and 130-133) to
ensure ranks-per-GPU stays within acceptable limits.
In `@recipes/gb300-fp8/1k8k/mtp/low-latency.yaml`:
- Around line 76-83: The max-total-tokens value is too low for the 1k8k
workload: with isl=1024 and osl=8192 the total token budget is 9216 and
context-length is 9300, so update the max-total-tokens setting from 8192 to 9216
in the low-latency recipes; modify the max-total-tokens entry in the affected
YAML so it matches the 9216 total (ensure the change is applied in both the MTP
and STP low-latency files where max-total-tokens appears).
In `@recipes/gb300-fp8/1k8k/mtp/mid.yaml`:
- Around line 13-19: The parallelism settings in the resources block (gpu_type,
prefill_nodes, prefill_workers, decode_nodes, decode_workers, gpus_per_node)
imply extreme oversubscription (e.g., 512 and 110,592 ranks) so update the
configuration to match the actual GPU availability: either correct
prefill_nodes/prefill_workers and decode_nodes/decode_workers to produce the
intended total GPU counts, or adjust the tensor/model/data parallel dimensions
(tp/dp/ep) elsewhere in the recipe to equal total_gpus = nodes * gpus_per_node;
ensure tp * dp * ep does not exceed total_gpus for prefill and decode phases and
apply the same correction to the other occurrences referenced (lines ~75-78 and
~134-137).
- Around line 70-73: Add a pinned revision when using trust-remote-code: update
the model spec that contains served-model-name: "deepseek-ai/DeepSeek-R1" (and
the same block at lines ~128-131) to include a revision field (commit SHA or
tag) or point to an internal mirror instead of leaving it unpinned; ensure the
YAML entry next to served-model-name/skip-tokenizer-init/trust-remote-code
includes the equivalent of SGLang’s --revision value to lock the remote code, or
add a short comment explaining why unpinned remote code is acceptable if you
choose not to pin.
In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml`:
- Around line 13-19: The configured tensor/model/data parallelism (tp/dp/ep) is
massively larger than the actual GPUs allocated in resources; compute total GPUs
as nodes * gpus_per_node * workers for prefill (prefill_nodes, prefill_workers,
gpus_per_node) and decode (decode_nodes, decode_workers, gpus_per_node) and
either reduce the tp/dp/ep values so their product (tp * dp * ep) does not
exceed total GPUs or proportionally increase nodes/workers to match the desired
parallelism; update the tp/dp/ep settings referenced around the existing
resource blocks (prefill_* and decode_*) and the other occurrences called out
(lines ~71-74 and ~131-134) so each parallelism product ≤ allocated GPUs.
🧹 Nitpick comments (1)
recipes/gb200-fp8/1k8k/low-latency-mtp.yaml (1)
25-65: JIT DeepGEMM flags are internally inconsistent.
SGLANG_ENABLE_JIT_DEEPGEMMisfalsebutSGLANG_JIT_DEEPGEMM_FAST_WARMUPis1in both env blocks. If JIT is intentionally off, drop the warmup flag; if warmup is desired, enable JIT.♻️ Example fix (remove warmup when JIT is off)
prefill_environment: ... - SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" ... decode_environment: ... - SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
| # GB200 FP8 Max Throughput Configuration | ||
|
|
||
| name: "gb200-fp8-mid-curve-3p1d-mtp" |
There was a problem hiding this comment.
Inconsistent header comment.
The comment on Line 1 says "Max Throughput Configuration" but the name and filename indicate this is a "mid-curve" configuration. This appears to be a copy-paste artifact.
Proposed fix
-# GB200 FP8 Max Throughput Configuration
+# GB200 FP8 Mid-Curve 3P1D MTP Configuration📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # GB200 FP8 Max Throughput Configuration | |
| name: "gb200-fp8-mid-curve-3p1d-mtp" | |
| # GB200 FP8 Mid-Curve 3P1D MTP Configuration | |
| name: "gb200-fp8-mid-curve-3p1d-mtp" |
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml` around lines 1 - 3, The top
comment incorrectly reads "Max Throughput Configuration"; update the header
comment to match the actual recipe type by changing it to indicate "Mid-Curve
Configuration" (or similar) so it aligns with the name
"gb200-fp8-mid-curve-3p1d-mtp" and the filename; ensure the descriptive comment
at the top reflects mid-curve rather than max-throughput for clarity.
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml | head -n 150 | tail -n 80Repository: ishandhanani/srt-slurm
Length of output: 2630
🏁 Script executed:
cat -n recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml | head -n 75 | tail -n 30Repository: ishandhanani/srt-slurm
Length of output: 1281
🏁 Script executed:
cat recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml | head -n 1 && wc -l recipes/gb200-fp8/1k8k/max-tpt-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 144
🏁 Script executed:
cat -n recipes/gb200-fp8/1k8k/max-tpt-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 6577
🏁 Script executed:
rg -i "revision|commit|ref" recipes/gb200-fp8/1k8k/max-tpt-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 360
🏁 Script executed:
fd "README\|readme\|SECURITY\|security" . --type f | head -20Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
rg -i "deepseek" . --type yaml --type md --type txt -B2 -A2Repository: ishandhanani/srt-slurm
Length of output: 50378
🏁 Script executed:
git log --oneline --all -- recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml | head -5Repository: ishandhanani/srt-slurm
Length of output: 109
Add security justification or pin DeepSeek-R1 model to a specific revision.
trust-remote-code: true enables arbitrary code execution from the model repository. This pattern appears throughout the codebase without explicit security justification or model revision pinning. Either document why this setting is required for DeepSeek-R1, pin to a specific commit hash (e.g., deepseek-ai/DeepSeek-R1@<commit-hash>), or disable if not essential.
Also applies to: 129-129
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml` at line 71, The YAML sets
trust-remote-code: true for the DeepSeek-R1 model which allows arbitrary remote
code execution; either add a concise security justification comment next to
trust-remote-code explaining why remote code is required for DeepSeek-R1 (e.g.,
necessary custom tokenizer/architecture) or pin the model reference to a
specific commit by changing the model identifier to include a commit hash (e.g.,
deepseek-ai/DeepSeek-R1@<commit-hash>), or disable the flag if not needed;
update occurrences that match the same pattern (the trust-remote-code key and
the DeepSeek-R1 model reference) so all instances are consistently justified,
pinned, or disabled.
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 8 | ||
| ep-size: 8 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "mid-curve-5p1d-mtp.yaml" -type fRepository: ishandhanani/srt-slurm
Length of output: 114
🏁 Script executed:
cat -n ./recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 6233
🏁 Script executed:
find ./recipes -name "*.yaml" -type f | head -20Repository: ishandhanani/srt-slurm
Length of output: 910
🏁 Script executed:
head -150 ./recipes/gb200-fp8/8k1k/max-tpt.yaml | tail -100Repository: ishandhanani/srt-slurm
Length of output: 162
🏁 Script executed:
ls -la ./recipes/gb200-fp8/8k1k/Repository: ishandhanani/srt-slurm
Length of output: 455
🏁 Script executed:
cat -n ./recipes/gb200-fp8/8k1k/mid-curve-5p1d.yaml | sed -n '60,135p'Repository: ishandhanani/srt-slurm
Length of output: 2516
🏁 Script executed:
cat -n ./recipes/gb200-fp8/8k1k/low-latency.yaml | sed -n '60,135p'Repository: ishandhanani/srt-slurm
Length of output: 1910
🏁 Script executed:
find ./recipes -name "README*" -o -name "*.md" | head -10Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
ls -la ./ | grep -i readmeRepository: ishandhanani/srt-slurm
Length of output: 125
🏁 Script executed:
cat ./README.mdRepository: ishandhanani/srt-slurm
Length of output: 1271
🏁 Script executed:
find . -name "*.py" -type f | xargs grep -l "tp-size\|tp_size\|tensor-parallel" | head -10Repository: ishandhanani/srt-slurm
Length of output: 214
🏁 Script executed:
cat ./analysis/srtlog/config_reader.pyRepository: ishandhanani/srt-slurm
Length of output: 14702
🌐 Web query:
SGLang tp-size dp-size ep-size world size semantics parallelism
💡 Result:
In SGLang, these flags describe how many GPU ranks you allocate to each parallelism dimension, and they must multiply to the total number of ranks (GPUs) you launch.
tp-size(Tensor Parallel / TP): splits a single model replica’s large weight matrices acrosstp-sizeGPUs (model-sharding within a replica). [4]dp-size(Data Parallel / DP): runsdp-sizeindependent replicas of the model (each replica gets its own share of requests). [3]ep-size(Expert Parallel / EP): for MoE models, shards/places experts acrossep-sizeGPUs to reduce expert-weight memory and enable MoE scaling. [1][2]world size: total number of distributed ranks participating in the job; in multi-node setups it’s typicallynnodes × gpus_per_node. [3]
How they relate (the key semantic rule)
SGLang validates that the total GPUs/ranks used by the serving engine equal the product of the parallelism sizes:
[
\textbf{total_gpus} = tp_size \times pp_size \times ep_size \times dp_size
]
(where pp_size is pipeline parallelism, if used). [3]
Important EP constraint (common gotcha)
Some EP backends (e.g., DeepEP/Mooncake) only support ep_size = tp_size; if you want hybrid EP+TP with ep_size < tp_size, SGLang may restrict you to other (less optimized) dispatch backends. [1]
🏁 Script executed:
grep -n "prefill_nodes\|prefill_workers\|decode_nodes\|decode_workers\|gpus_per_node" ./recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yamlRepository: ishandhanani/srt-slurm
Length of output: 180
Critical GPU allocation mismatch: parallelism settings incompatible with available GPUs.
Prefill requires 512 GPUs (8×8×8) but only 40 are available (10 nodes × 4 GPUs); decode requires 32,768 GPUs (32×32×32) but only 32 are available (8 nodes × 4 GPUs). SGLang will fail to launch. Align tp-size, dp-size, and ep-size so their product equals available GPU count per stage. Same issue at lines 126-129.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yaml` around lines 68 - 71, The
tp-size, dp-size, and ep-size settings (tp-size, dp-size, ep-size) currently
multiply to 512/32,768 which exceeds the available GPUs and will prevent launch;
update the parallelism triplets at both occurrences (the block around tp-size: 8
/ dp-size: 8 / ep-size: 8 and the similar block later) so that tp-size * dp-size
* ep-size equals the actual available GPU count for the corresponding stage
(e.g., set the three values to a combination whose product equals 40 for the
prefill stage and equals 32 for the decode stage, or otherwise scale them down
to the node/GPU capacity you have).
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n recipes/gb300-fp8/1k1k/mtp/max.yaml | head -80Repository: ishandhanani/srt-slurm
Length of output: 2945
🏁 Script executed:
sed -n '120,135p' recipes/gb300-fp8/1k1k/mtp/max.yamlRepository: ishandhanani/srt-slurm
Length of output: 441
🏁 Script executed:
find recipes -name "*.yaml" -type f | head -10 | xargs grep -l "served-model-name"Repository: ishandhanani/srt-slurm
Length of output: 446
🏁 Script executed:
grep -A2 -B2 "trust-remote-code" recipes/gb200-fp4/1k1k/max-tpt.yaml | head -20Repository: ishandhanani/srt-slurm
Length of output: 350
🏁 Script executed:
grep -r "revision\|commit" recipes/gb300-fp8/1k1k/mtp/ || echo "No revision/commit fields found"Repository: ishandhanani/srt-slurm
Length of output: 97
🏁 Script executed:
find . -name "README*" -o -name "*.md" | xargs grep -l "model\|revision" | head -5Repository: ishandhanani/srt-slurm
Length of output: 171
🏁 Script executed:
grep -A10 -B5 "served-model-name" docs/config-reference.mdRepository: ishandhanani/srt-slurm
Length of output: 1010
🏁 Script executed:
grep -r "HuggingFace\|model.*revision\|huggingface" docs/ --include="*.md" | head -20Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
rg "served-model-name|trust-remote-code" --type python | head -20Repository: ishandhanani/srt-slurm
Length of output: 96
🏁 Script executed:
git ls-files | grep -E "\.(py|sh)$" | head -20Repository: ishandhanani/srt-slurm
Length of output: 687
🏁 Script executed:
cat -n analysis/srtlog/config_reader.py | head -100Repository: ishandhanani/srt-slurm
Length of output: 3828
🌐 Web query:
SGLang served-model-name revision commit pinning trust-remote-code
💡 Result:
In SGLang (CLI: python -m sglang.launch_server):
--served-model-name: overrides the model name returned by the OpenAI-compatibleGET /v1/modelsendpoint. [1]--revision: selects a specific Hugging Face Hub model version; it can be a branch name, tag, or commit id (i.e., commit pinning). [1]--trust-remote-code: enables loading models that require custom code from the Hub (equivalent idea to Transformers’ “trust remote code”). Default isFalse; pass the flag to enable. [1]
Example (pin to a specific commit):
python -m sglang.launch_server \
--model-path org/model-repo \
--revision <commit_sha_or_tag_or_branch> \
--served-model-name my-model \
--trust-remote-codeHugging Face’s Hub tooling also documents that revision can be a branch/tag/commit hash when downloading snapshots/files, which is the same concept SGLang exposes via --revision. [2]
Sources: [1] SGLang server arguments docs, [2] Hugging Face Hub download docs.
🏁 Script executed:
grep -r "revision:" recipes/ --include="*.yaml" | head -5Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
grep -i "revision" docs/config-reference.mdRepository: ishandhanani/srt-slurm
Length of output: 48
Add revision pinning to model configuration when using trust-remote-code: true.
Both the prefill (lines 67–69) and decode (lines 125–127) sections enable trust-remote-code without pinning the model revision. Unpinned remote code can change and execute arbitrary code. Use SGLang's --revision parameter to pin a specific commit, tag, or branch—add a revision field to the sglang_config to specify the exact model version.
Affected locations
- Lines 67–69: prefill model config
- Lines 125–127: decode model config
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/1k1k/mtp/max.yaml` around lines 67 - 69, The model configs
enable trust-remote-code without pinning a revision; update the sglang_config
for both the prefill model block (containing
served-model-name/skip-tokenizer-init/trust-remote-code) and the decode model
block to include a revision field that passes the desired pinned
commit/tag/branch via the SGLang --revision parameter (e.g., add "revision":
"<commit-or-tag>") so remote code execution is tied to a specific model version.
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Locate and examine the YAML file
fd "mid.yaml" recipes/Repository: ishandhanani/srt-slurm
Length of output: 281
🏁 Script executed:
# Read the file to see the context around lines 66-68 and 124-126
cat -n recipes/gb300-fp8/1k1k/mtp/mid.yaml | sed -n '60,130p'Repository: ishandhanani/srt-slurm
Length of output: 2406
🏁 Script executed:
# Search for served-model-name patterns to see if pinning is used elsewhere
rg "served-model-name" --type yaml -A 1 -B 1Repository: ishandhanani/srt-slurm
Length of output: 31391
🏁 Script executed:
# Look for any documentation or examples about model pinning
fd -e md -e txt -e rst | xargs rg -l "revision|pin" 2>/dev/null | head -10Repository: ishandhanani/srt-slurm
Length of output: 176
🏁 Script executed:
# Check if there are any other similar configurations with different patterns
rg "trust-remote-code" --type yaml -B 2 -A 2Repository: ishandhanani/srt-slurm
Length of output: 45849
Pin DeepSeek-R1 revision when trust-remote-code is enabled.
Unpinned remote code allows automatic fetching of model updates without integrity verification, introducing supply-chain RCE risk. Specify a fixed revision (e.g., deepseek-ai/DeepSeek-R1@<commit-hash>) or use a locally mirrored copy.
Also applies to: lines 124-126 (decode section)
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/1k1k/mtp/mid.yaml` around lines 66 - 68, The recipe
currently sets served-model-name ("deepseek-ai/DeepSeek-R1") with
trust-remote-code: true without a pinned revision; update the served-model-name
to include a fixed revision (for example deepseek-ai/DeepSeek-R1@<commit-hash>)
or point it to a locally mirrored copy to prevent automatic unverified code
updates, and apply the same change to the decode section where the model is
referenced (the block containing skip-tokenizer-init, trust-remote-code and the
decode section at lines ~124-126) so both runtime and decoding use the pinned
revision.
| watchdog-timeout: 1000000 | ||
| context-length: 9300 | ||
| disaggregation-mode: "prefill" | ||
| disaggregation-transfer-backend: nixl | ||
| mem-fraction-static: 0.95 | ||
| max-total-tokens: 8192 | ||
| chunked-prefill-size: 8192 | ||
| max-prefill-tokens: 8192 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "low-latency.yaml" -path "*gb300-fp8*" | head -20Repository: ishandhanani/srt-slurm
Length of output: 341
🏁 Script executed:
wc -l ./recipes/gb300-fp8/1k8k/mtp/low-latency.yamlRepository: ishandhanani/srt-slurm
Length of output: 115
🏁 Script executed:
cat -n ./recipes/gb300-fp8/1k8k/mtp/low-latency.yaml | sed -n '70,90p'Repository: ishandhanani/srt-slurm
Length of output: 962
🏁 Script executed:
cat -n ./recipes/gb300-fp8/1k8k/mtp/low-latency.yaml | sed -n '125,136p'Repository: ishandhanani/srt-slurm
Length of output: 391
🏁 Script executed:
cat -n ./recipes/gb300-fp8/1k1k/mtp/low-latency.yaml | sed -n '70,90p'Repository: ishandhanani/srt-slurm
Length of output: 961
🏁 Script executed:
cat -n ./recipes/gb300-fp8/8k1k/mtp/low-latency.yaml | sed -n '70,90p'Repository: ishandhanani/srt-slurm
Length of output: 965
🏁 Script executed:
grep -n "max-total-tokens" ./recipes/gb300-fp8/1k8k/mtp/low-latency.yamlRepository: ishandhanani/srt-slurm
Length of output: 98
🏁 Script executed:
cat -n ./recipes/gb300-fp8/1k8k/stp/low-latency.yaml | sed -n '70,90p'Repository: ishandhanani/srt-slurm
Length of output: 897
🏁 Script executed:
grep -A 5 "^benchmark:" ./recipes/gb300-fp8/1k8k/stp/low-latency.yamlRepository: ishandhanani/srt-slurm
Length of output: 167
Align max-total-tokens with the 1k8k workload in both MTP and STP variants.
The benchmark specifies isl: 1024 and osl: 8192 (total: 9216 tokens), but max-total-tokens is set to 8192. Since context-length is 9300, raise max-total-tokens to match.
🛠️ Proposed adjustment
- max-total-tokens: 8192
+ max-total-tokens: 9300Applies to both:
recipes/gb300-fp8/1k8k/mtp/low-latency.yaml(line 81)recipes/gb300-fp8/1k8k/stp/low-latency.yaml(line 77)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| watchdog-timeout: 1000000 | |
| context-length: 9300 | |
| disaggregation-mode: "prefill" | |
| disaggregation-transfer-backend: nixl | |
| mem-fraction-static: 0.95 | |
| max-total-tokens: 8192 | |
| chunked-prefill-size: 8192 | |
| max-prefill-tokens: 8192 | |
| watchdog-timeout: 1000000 | |
| context-length: 9300 | |
| disaggregation-mode: "prefill" | |
| disaggregation-transfer-backend: nixl | |
| mem-fraction-static: 0.95 | |
| max-total-tokens: 9300 | |
| chunked-prefill-size: 8192 | |
| max-prefill-tokens: 8192 |
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/1k8k/mtp/low-latency.yaml` around lines 76 - 83, The
max-total-tokens value is too low for the 1k8k workload: with isl=1024 and
osl=8192 the total token budget is 9216 and context-length is 9300, so update
the max-total-tokens setting from 8192 to 9216 in the low-latency recipes;
modify the max-total-tokens entry in the affected YAML so it matches the 9216
total (ensure the change is applied in both the MTP and STP low-latency files
where max-total-tokens appears).
| resources: | ||
| gpu_type: "gb300" | ||
| prefill_nodes: 2 | ||
| prefill_workers: 1 | ||
| decode_nodes: 12 | ||
| decode_workers: 1 | ||
| gpus_per_node: 4 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "mid.yaml" -o -name "*gb300*" -type f | head -20Repository: ishandhanani/srt-slurm
Length of output: 293
🏁 Script executed:
git ls-files | grep -E "(recipes|gb300)" | head -30Repository: ishandhanani/srt-slurm
Length of output: 1315
🏁 Script executed:
cat -n ./recipes/gb300-fp8/1k8k/mtp/mid.yamlRepository: ishandhanani/srt-slurm
Length of output: 6727
Parallelism configuration requires extreme GPU oversubscription.
Prefill requires 512 ranks (8×8×8) on 8 GPUs; decode requires 110,592 ranks (48×48×48) on 48 GPUs. This produces 64:1 and 2,304:1 ratios respectively. Clarify whether these parallelism dimensions are correctly specified for this hardware setup, or adjust tp/dp/ep to align with available GPU count.
Also applies to: lines 75-78, 134-137
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/1k8k/mtp/mid.yaml` around lines 13 - 19, The parallelism
settings in the resources block (gpu_type, prefill_nodes, prefill_workers,
decode_nodes, decode_workers, gpus_per_node) imply extreme oversubscription
(e.g., 512 and 110,592 ranks) so update the configuration to match the actual
GPU availability: either correct prefill_nodes/prefill_workers and
decode_nodes/decode_workers to produce the intended total GPU counts, or adjust
the tensor/model/data parallel dimensions (tp/dp/ep) elsewhere in the recipe to
equal total_gpus = nodes * gpus_per_node; ensure tp * dp * ep does not exceed
total_gpus for prefill and decode phases and apply the same correction to the
other occurrences referenced (lines ~75-78 and ~134-137).
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -e yaml recipes/gb300-fp8/1k8k/mtp/mid.yamlRepository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
cat -n recipes/gb300-fp8/1k8k/mtp/mid.yaml | sed -n '60,80p'Repository: ishandhanani/srt-slurm
Length of output: 765
🏁 Script executed:
cat -n recipes/gb300-fp8/1k8k/mtp/mid.yaml | sed -n '120,140p'Repository: ishandhanani/srt-slurm
Length of output: 740
🏁 Script executed:
rg "model.*revision\|revision.*:" recipes/gb300-fp8 -B 2 -A 2Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
grep -r "served-model-name" recipes/ | head -20Repository: ishandhanani/srt-slurm
Length of output: 1847
🌐 Web query:
sglang served-model-name revision parameter trust-remote-code
💡 Result:
In SGLang (python -m sglang.launch_server ...), these CLI flags mean:
--trust-remote-code: Allows loading models that ship custom Python modeling code from the model hub (disabled by default; pass the flag to enable). [1]--revision <rev>: Pins the model to a specific Hub version: branch name, tag, or commit id (defaults to the repo’s default revision if omitted). [1]--served-model-name <name>: Overrides the model name returned by the OpenAI-compatiblev1/modelsendpoint (useful if you want clients to see a stable/alias name instead of the repo ID). [1]
Example:
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-VL-7B-Instruct \
--revision main \
--trust-remote-code \
--served-model-name my-qwen-prod \
--host 0.0.0.0 --port 30000Sources: [1] [2]
Pin the model revision when using trust-remote-code.
Unpinned remote code can change and execute arbitrarily. SGLang supports --revision parameter to pin a commit/tag. Add revision specification, use an internal mirror, or document why this is acceptable.
Also applies to lines 128-131.
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/1k8k/mtp/mid.yaml` around lines 70 - 73, Add a pinned
revision when using trust-remote-code: update the model spec that contains
served-model-name: "deepseek-ai/DeepSeek-R1" (and the same block at lines
~128-131) to include a revision field (commit SHA or tag) or point to an
internal mirror instead of leaving it unpinned; ensure the YAML entry next to
served-model-name/skip-tokenizer-init/trust-remote-code includes the equivalent
of SGLang’s --revision value to lock the remote code, or add a short comment
explaining why unpinned remote code is acceptable if you choose not to pin.
| resources: | ||
| gpu_type: "gb300" | ||
| prefill_nodes: 10 | ||
| prefill_workers: 5 | ||
| decode_nodes: 8 | ||
| decode_workers: 1 | ||
| gpus_per_node: 4 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and read the configuration file
fd "mid.yaml" recipes/Repository: ishandhanani/srt-slurm
Length of output: 281
🏁 Script executed:
# Read the complete configuration file to understand all parameters
cat recipes/gb300-fp8/8k1k/mtp/mid.yamlRepository: ishandhanani/srt-slurm
Length of output: 5342
🏁 Script executed:
# Search for tp/dp/ep parallelism configuration in related files
rg "tp.*dp.*ep|tensor_parallel|data_parallel|expert_parallel" recipes/gb300-fp8/8k1k/mtp/ -A 2 -B 2Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
# Check the other sections mentioned (lines 71-74, 131-134)
sed -n '71,74p; 131,134p' recipes/gb300-fp8/8k1k/mtp/mid.yamlRepository: ishandhanani/srt-slurm
Length of output: 226
🏁 Script executed:
# Compare with other similar config files to understand the pattern
cat recipes/gb300-fp8/8k1k/stp/mid.yaml | grep -A 10 "resources:"Repository: ishandhanani/srt-slurm
Length of output: 250
🏁 Script executed:
# Check if there are any configuration documents or READMEs explaining resource allocation
find recipes/ -name "README*" -o -name "*.md" | head -5Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
# Search for how prefill_workers and decode_workers are used in the codebase
rg "prefill_workers|decode_workers" --type yaml --type py -B 2 -A 2Repository: ishandhanani/srt-slurm
Length of output: 50379
🏁 Script executed:
# Get line count to ensure safe extraction
wc -l recipes/gb300-fp8/8k1k/mtp/mid.yamlRepository: ishandhanani/srt-slurm
Length of output: 105
🏁 Script executed:
# Check lines 71-74 and 131-134
sed -n '65,80p; 125,140p' recipes/gb300-fp8/8k1k/mtp/mid.yamlRepository: ishandhanani/srt-slurm
Length of output: 828
🏁 Script executed:
# Look at the schema.py to understand GPU calculation logic
sed -n '1,100p' src/srtctl/core/schema.py | grep -A 20 "gpus_per_prefill\|gpus_per_decode"Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
# Get more context around the parallelism sections
sed -n '65,85p; 125,145p' recipes/gb300-fp8/8k1k/mtp/mid.yamlRepository: ishandhanani/srt-slurm
Length of output: 1047
🏁 Script executed:
# Check if there are any comments or validation logic about parallelism
grep -A 5 -B 5 "gpus_per_prefill\|gpus_per_decode" src/srtctl/core/schema.py | head -40Repository: ishandhanani/srt-slurm
Length of output: 1333
🏁 Script executed:
# Check similar config files to see if this pattern is common
grep -l "prefill_workers: 5" recipes/gb300-fp8/*/*.yaml
grep -l "tp-size: 32" recipes/gb300-fp8/*/*.yamlRepository: ishandhanani/srt-slurm
Length of output: 185
🏁 Script executed:
# Find configs with prefill_workers: 5
find recipes/gb300-fp8 -name "*.yaml" -exec grep -l "prefill_workers: 5" {} \;Repository: ishandhanani/srt-slurm
Length of output: 137
🏁 Script executed:
# Find configs with tp-size: 32
find recipes/gb300-fp8 -name "*.yaml" -exec grep -l "tp-size: 32" {} \;Repository: ishandhanani/srt-slurm
Length of output: 281
🏁 Script executed:
# Let's examine what the actual total GPUs are for each stage
python3 << 'EOF'
# Prefill: 10 nodes * 5 workers * 4 GPUs/node = ?
prefill_total_gpus = 10 * 4 # Total for prefill stage (not multiplied by workers)
prefill_workers = 5
gpus_per_prefill_worker = (10 * 4) // 5
# Decode: 8 nodes * 1 worker * 4 GPUs/node = ?
decode_total_gpus = 8 * 4 # Total for decode stage (not multiplied by workers)
decode_workers = 1
gpus_per_decode_worker = (8 * 4) // 1
# Parallelism
prefill_tp = 8
prefill_dp = 8
prefill_ep = 8
prefill_world_size = prefill_tp * prefill_dp * prefill_ep
decode_tp = 32
decode_dp = 32
decode_ep = 32
decode_world_size = decode_tp * decode_dp * decode_ep
print(f"Prefill: {prefill_total_gpus} total GPUs, {prefill_workers} workers, {gpus_per_prefill_worker} GPUs/worker")
print(f"Prefill parallelism: tp={prefill_tp}, dp={prefill_dp}, ep={prefill_ep}, world_size={prefill_world_size}")
print(f"Prefill oversubscription: {prefill_world_size} ranks / {gpus_per_prefill_worker} GPUs per worker = {prefill_world_size / gpus_per_prefill_worker:.1f}x")
print()
print(f"Decode: {decode_total_gpus} total GPUs, {decode_workers} workers, {gpus_per_decode_worker} GPUs/worker")
print(f"Decode parallelism: tp={decode_tp}, dp={decode_dp}, ep={decode_ep}, world_size={decode_world_size}")
print(f"Decode oversubscription: {decode_world_size} ranks / {gpus_per_decode_worker} GPUs per worker = {decode_world_size / gpus_per_decode_worker:.1f}x")
EOFRepository: ishandhanani/srt-slurm
Length of output: 408
🏁 Script executed:
# Check the STP version of the same config to compare
cat recipes/gb300-fp8/8k1k/stp/mid.yaml | grep -A 20 "resources:"Repository: ishandhanani/srt-slurm
Length of output: 664
🏁 Script executed:
# Look for any comments or docs about disaggregation and parallelism
grep -r "disaggregation" recipes/gb300-fp8/8k1k/mtp/mid.yaml -iRepository: ishandhanani/srt-slurm
Length of output: 640
🏁 Script executed:
# Check if there are validation warnings or known issues in test files
grep -r "oversubscription\|world_size\|512\|32768" tests/ --include="*.py" -B 2 -A 2 | head -40Repository: ishandhanani/srt-slurm
Length of output: 585
Align tp/dp/ep parallelism with allocated GPU resources.
Prefill allocates 40 GPUs (10 nodes × 4 gpus/node, 5 workers = 8 GPUs/worker) for 512 parallelism ranks (tp=8, dp=8, ep=8), creating 64x oversubscription. Decode allocates 32 GPUs (8 nodes × 4 gpus/node, 1 worker) for 32,768 ranks (tp=32, dp=32, ep=32), creating 1024x oversubscription. Reduce tp/dp/ep values or increase GPU allocation to match the parallelism requirements.
Also applies to: 71-74, 131-134
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml` around lines 13 - 19, The configured
tensor/model/data parallelism (tp/dp/ep) is massively larger than the actual
GPUs allocated in resources; compute total GPUs as nodes * gpus_per_node *
workers for prefill (prefill_nodes, prefill_workers, gpus_per_node) and decode
(decode_nodes, decode_workers, gpus_per_node) and either reduce the tp/dp/ep
values so their product (tp * dp * ep) does not exceed total GPUs or
proportionally increase nodes/workers to match the desired parallelism; update
the tp/dp/ep settings referenced around the existing resource blocks (prefill_*
and decode_*) and the other occurrences called out (lines ~71-74 and ~131-134)
so each parallelism product ≤ allocated GPUs.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml`:
- Line 5: The Docker image tag for SGLang is invalid: replace the container
value "lmsysorg/sglang:v0.5.8-cu130" with the CUDA13 runtime-suffixed tag (e.g.,
"lmsysorg/sglang:v0.5.8-cu130-runtime"); if v0.5.8-cu130-runtime is not
published, use the confirmed available tag
"lmsysorg/sglang:v0.5.7-cu130-runtime" instead and verify against the official
SGLang image tags; update the container field accordingly.
In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml`:
- Around line 153-156: The decode section is missing the max-total-tokens
parameter; add a line under the decode block setting max-total-tokens: 524288
(to match the prefill section's max-total-tokens) so the decode block explicitly
defines the same total-token limit (compare with the prefill max-total-tokens
and the other recipe max-tpt-2p1d-mtp.yaml if you intend a different value).
In `@recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml`:
- Around line 45-64: The decode_environment block is missing the
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM flag alongside the existing
SGLANG_DECODE_BOOTSTRAP_TIMEOUT; add SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" to the
decode_environment entries (or add a comment explaining why it's intentionally
omitted for the 1k8k profile) so the behavior matches the other profile that
includes both SGLANG_DECODE_BOOTSTRAP_TIMEOUT and
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM.
| model: | ||
| path: "dsfp8" | ||
| container: "0.5.8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
SGLang v0.5.8 Docker image tags lmsysorg CUDA 13
💡 Result:
I couldn’t find any published lmsysorg/sglang Docker tag for v0.5.8 that targets CUDA 13 (e.g., no evidence of a v0.5.8-cu13* / v0.5.8-cu130* tag in the public listings I could access).
What is documented/visible for CUDA 13 with lmsysorg/sglang:
- Nightly / dev (CUDA 13):
lmsysorg/sglang:dev-cu13(docs.sglang.io) - Stable (CUDA 13.0 runtime line):
lmsysorg/sglang:latest-cu130-runtime(docs.sglang.io) - Versioned CUDA 13.0 runtime tag I can confirm exists (latest I saw):
lmsysorg/sglang:v0.5.7-cu130-runtime(docker.aityp.com)
If you want, tell me whether you need build image vs -runtime, and I can try to enumerate any additional *-cu13 / *-cu130-* tags that are currently published.
Citations:
- 1: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 2: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 3: https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime?utm_source=openai
🌐 Web query:
Docker Hub lmsysorg/sglang v0.5.8 available tags cu130 runtime
💡 Result:
I couldn’t find any evidence that a v0.5.8-*-cu130-runtime tag exists for lmsysorg/sglang on Docker Hub (no search hits for v0.5.8-cu130-runtime, and the Docker Hub repo page doesn’t surface any v0.5.8 tags). [1]
What is referenced/available for CUDA 13.0 runtime:
- Stable (recommended) CUDA 13 runtime tag:
lmsysorg/sglang:latest-cu130-runtime(called out in the official install docs for “CUDA 13 environment”). [2] - Latest versioned CUDA 13 runtime tag I could find:
lmsysorg/sglang:v0.5.7-cu130-runtime(listed by a Docker Hub mirror page that tracks that exact tag). [3]
Sources:
[1] Docker Hub repo page for lmsysorg/sglang (hub.docker.com)
[2] SGLang install docs mentioning latest-cu130-runtime (docs.sglang.io)
[3] Mirror page showing v0.5.7-cu130-runtime exists (docker.aityp.com)
Citations:
- 1: https://hub.docker.com/r/lmsysorg/sglang/?utm_source=openai
- 2: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 3: https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime?utm_source=openai
🌐 Web query:
SGLang CUDA 13 Docker image runtime variant "-runtime" suffix
💡 Result:
In SGLang’s official Docker Hub images (lmsysorg/sglang), the -runtime suffix means the production “runtime” variant of an image tag (e.g., latest-runtime) that is smaller (~40%) because it excludes build tools and development dependencies. [1]
For CUDA 13, SGLang recommends using either: [1]
- Nightly/dev CUDA 13 image:
lmsysorg/sglang:dev-cu13 - Stable CUDA 13 runtime image:
lmsysorg/sglang:latest-cu130-runtime(notecu130+-runtime)
Related naming in NVIDIA’s Dynamo artifacts also uses a “runtime” image plus a CUDA 13 tag (e.g., sglang-runtime:<version>-cuda13). [2]
Update the SGLang Docker image tag for CUDA 13 with the -runtime suffix.
The tag lmsysorg/sglang:v0.5.8-cu130 does not exist in published repositories. For CUDA 13, SGLang images require the -runtime suffix (e.g., lmsysorg/sglang:v0.5.8-cu130-runtime). The latest confirmed versioned CUDA 13 runtime tag is lmsysorg/sglang:v0.5.7-cu130-runtime. Verify whether v0.5.8 with CUDA 13 runtime has been released; if not, use the v0.5.7 tag or check the official SGLang documentation for the recommended CUDA 13 image.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml` at line 5, The Docker image tag
for SGLang is invalid: replace the container value
"lmsysorg/sglang:v0.5.8-cu130" with the CUDA13 runtime-suffixed tag (e.g.,
"lmsysorg/sglang:v0.5.8-cu130-runtime"); if v0.5.8-cu130-runtime is not
published, use the confirmed available tag
"lmsysorg/sglang:v0.5.7-cu130-runtime" instead and verify against the official
SGLang image tags; update the container field accordingly.
| # Memory and token limits | ||
| mem-fraction-static: 0.75 | ||
| chunked-prefill-size: 36864 | ||
|
|
There was a problem hiding this comment.
Missing max-total-tokens in decode section.
The prefill section specifies max-total-tokens: 524288 (line 96), but the decode section lacks this parameter. The comparable max-tpt-2p1d-mtp.yaml has max-total-tokens: 1703116 for decode. If this omission is intentional (relying on a default), consider adding a comment; otherwise, add the appropriate value.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml` around lines 153 - 156, The
decode section is missing the max-total-tokens parameter; add a line under the
decode block setting max-total-tokens: 524288 (to match the prefill section's
max-total-tokens) so the decode block explicitly defines the same total-token
limit (compare with the prefill max-total-tokens and the other recipe
max-tpt-2p1d-mtp.yaml if you intend a different value).
| # Decode-specific environment variables | ||
| decode_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" | ||
| SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768" | ||
| MC_TE_METRIC: "true" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| MC_FORCE_MNNVL: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
| PYTHONUNBUFFERED: "1" | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH: "1" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" |
There was a problem hiding this comment.
Missing SGLANG_HACK_SEQ_BOOTSTRAP_ROOM in decode environment.
Comparing to max-tpt-2p1d-mtp.yaml, the decode environment there includes SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" (alongside SGLANG_DECODE_BOOTSTRAP_TIMEOUT). This file has the timeout but lacks the hack flag. If this is intentional for the 1k8k profile, consider adding a comment; otherwise, add it for consistency.
Proposed fix if needed
SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+ SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml` around lines 45 - 64, The
decode_environment block is missing the SGLANG_HACK_SEQ_BOOTSTRAP_ROOM flag
alongside the existing SGLANG_DECODE_BOOTSTRAP_TIMEOUT; add
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" to the decode_environment entries (or add a
comment explaining why it's intentionally omitted for the 1k8k profile) so the
behavior matches the other profile that includes both
SGLANG_DECODE_BOOTSTRAP_TIMEOUT and SGLANG_HACK_SEQ_BOOTSTRAP_ROOM.
6230ad0 to
d6f43da
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
recipes/gb200-fp8/1k1k/low-latency-mtp.yaml (1)
24-24:⚠️ Potential issue | 🟡 MinorContradictory JIT DeepGEMM configuration.
SGLANG_ENABLE_JIT_DEEPGEMMis set to"false"(line 24), butSGLANG_JIT_DEEPGEMM_FAST_WARMUPis set to"1"(line 38). Enabling fast warmup for a disabled feature is contradictory and may indicate a configuration oversight.Please clarify the intent: if JIT DeepGEMM should be used, set
SGLANG_ENABLE_JIT_DEEPGEMM: "true"; otherwise, consider removing or disabling the fast warmup flag.Also applies to: 37-38
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml`:
- Around line 24-25: SGLANG_ENABLE_JIT_DEEPGEMM is currently "false" while
SGLANG_JIT_DEEPGEMM_FAST_WARMUP is "1", which is a no-op when JIT DeepGEMM is
disabled; fix by either setting SGLANG_ENABLE_JIT_DEEPGEMM to "true" to enable
JIT DeepGEMM (so SGLANG_JIT_DEEPGEMM_FAST_WARMUP takes effect) or remove/clear
SGLANG_JIT_DEEPGEMM_FAST_WARMUP so there is no misleading configuration; update
the values for SGLANG_ENABLE_JIT_DEEPGEMM and/or SGLANG_JIT_DEEPGEMM_FAST_WARMUP
to make them consistent.
- Around line 44-45: The decode environment has a configuration mismatch:
SGLANG_ENABLE_JIT_DEEPGEMM is set to "false" while
SGLANG_JIT_DEEPGEMM_FAST_WARMUP is "1"; update the decode env to be consistent
by either setting SGLANG_ENABLE_JIT_DEEPGEMM to "true" to enable JIT DeepGEMM
when keeping SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1", or disable/remove
SGLANG_JIT_DEEPGEMM_FAST_WARMUP if you intend to keep
SGLANG_ENABLE_JIT_DEEPGEMM: "false"; adjust the entries for
SGLANG_ENABLE_JIT_DEEPGEMM and SGLANG_JIT_DEEPGEMM_FAST_WARMUP so they reflect
the same intent.
In `@recipes/gb300-fp8/1k1k/mtp/max.yaml`:
- Line 185: The concurrencies entry is using YAML list syntax (concurrencies:
[4096,7168,7680]) which is inconsistent with other recipes that use a delimited
string; update the concurrencies field to the same string format used elsewhere
(e.g., the delimited form used across recipes) or confirm the benchmark runner
accepts list syntax, then either convert the value for the concurrencies key to
the canonical delimited string or add validation to accept both formats in the
runner; locate and change the concurrencies entry in this recipe (symbol:
concurrencies) to match the project convention.
🧹 Nitpick comments (2)
recipes/gb300-fp8/1k1k/mtp/max.yaml (1)
157-159: Consider addingmax-total-tokensfor decode.The prefill section specifies
max-total-tokens: 524288, but the decode section only hasmem-fraction-static: 0.75without an explicit token limit. For consistency with other max-throughput recipes (e.g.,gb200-fp8/1k1k/max-tpt-2p1d-mtp.yamlline 163), consider addingmax-total-tokensto the decode config.recipes/gb300-fp8/1k8k/mtp/max.yaml (1)
70-73: Confirmtrust-remote-code: trueis required and pin the model revision.
Line 72 and Line 130 enable remote code execution from the model repo. If this is required, please ensure the repo is vetted and pinned to a fixed revision or internal snapshot to avoid supply‑chain drift.Also applies to: 128-130
| SGLANG_ENABLE_JIT_DEEPGEMM: "false" | ||
| SGLANG_ENABLE_FLASHINFER_GEMM: "1" |
There was a problem hiding this comment.
Inconsistent SGLANG_ENABLE_JIT_DEEPGEMM setting.
SGLANG_ENABLE_JIT_DEEPGEMM is set to "false" (line 24), but SGLANG_JIT_DEEPGEMM_FAST_WARMUP is set to "1" (line 38). The fast warmup flag has no effect when JIT DeepGEMM is disabled. Either enable JIT DeepGEMM or remove the fast warmup setting.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml` around lines 24 - 25,
SGLANG_ENABLE_JIT_DEEPGEMM is currently "false" while
SGLANG_JIT_DEEPGEMM_FAST_WARMUP is "1", which is a no-op when JIT DeepGEMM is
disabled; fix by either setting SGLANG_ENABLE_JIT_DEEPGEMM to "true" to enable
JIT DeepGEMM (so SGLANG_JIT_DEEPGEMM_FAST_WARMUP takes effect) or remove/clear
SGLANG_JIT_DEEPGEMM_FAST_WARMUP so there is no misleading configuration; update
the values for SGLANG_ENABLE_JIT_DEEPGEMM and/or SGLANG_JIT_DEEPGEMM_FAST_WARMUP
to make them consistent.
| SGLANG_ENABLE_JIT_DEEPGEMM: "false" | ||
| SGLANG_ENABLE_FLASHINFER_GEMM: "1" |
There was a problem hiding this comment.
Same JIT DeepGEMM inconsistency in decode environment.
The decode environment has the same mismatch: SGLANG_ENABLE_JIT_DEEPGEMM: "false" (line 44) with SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" (line 59).
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml` around lines 44 - 45, The decode
environment has a configuration mismatch: SGLANG_ENABLE_JIT_DEEPGEMM is set to
"false" while SGLANG_JIT_DEEPGEMM_FAST_WARMUP is "1"; update the decode env to
be consistent by either setting SGLANG_ENABLE_JIT_DEEPGEMM to "true" to enable
JIT DeepGEMM when keeping SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1", or
disable/remove SGLANG_JIT_DEEPGEMM_FAST_WARMUP if you intend to keep
SGLANG_ENABLE_JIT_DEEPGEMM: "false"; adjust the entries for
SGLANG_ENABLE_JIT_DEEPGEMM and SGLANG_JIT_DEEPGEMM_FAST_WARMUP so they reflect
the same intent.
| type: "sa-bench" | ||
| isl: 1024 | ||
| osl: 1024 | ||
| concurrencies: [4096,7168,7680] |
There was a problem hiding this comment.
Inconsistent concurrencies format.
This file uses a YAML list [4096,7168,7680] while other recipes use a delimited string format like "4x8x32x64x80x96x112x128". Verify the benchmark runner accepts both formats, or align with the convention used elsewhere.
Example fix (if string format is required)
- concurrencies: [4096,7168,7680]
+ concurrencies: "4096x7168x7680"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| concurrencies: [4096,7168,7680] | |
| concurrencies: "4096x7168x7680" |
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/1k1k/mtp/max.yaml` at line 185, The concurrencies entry is
using YAML list syntax (concurrencies: [4096,7168,7680]) which is inconsistent
with other recipes that use a delimited string; update the concurrencies field
to the same string format used elsewhere (e.g., the delimited form used across
recipes) or confirm the benchmark runner accepts list syntax, then either
convert the value for the concurrencies key to the canonical delimited string or
add validation to accept both formats in the runner; locate and change the
concurrencies entry in this recipe (symbol: concurrencies) to match the project
convention.
* Add GB300 MTP recipes * gb200 * rest of configs * fixes * gb200: use fast warmup, no dg cache * fix oom on 8k1k prefill. update low latency containers * fixes * Working 8k1k LL MTP. and mid * add nginx * update container for gb300 too
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@recipes/gb300-fp8/1k1k/mtp/max.yaml`:
- Around line 157-159: The decode section is missing the max-total-tokens
setting; add an explicit max-total-tokens: 524288 entry under the decode section
(matching the prefill section) to maintain consistency with other recipes and
avoid implicit limits—update the decode block near mem-fraction-static and
chunked-prefill-size so it includes max-total-tokens: 524288.
🧹 Nitpick comments (5)
recipes/gb300-fp8/1k1k/mtp/mid.yaml (3)
9-10: Consider pinning the nginx container version.Using
nginxwithout a version tag may lead to inconsistent deployments as thelatesttag changes over time.💡 Suggested fix
frontend: - nginx_container: nginx + nginx_container: nginx:1.27
25-25: Hardcoded date in cache directory path may become stale.
SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"contains what appears to be a date (October 21, 2025). This could cause confusion as the configuration ages or if the cache directory structure changes.Also applies to: 44-44
116-120: Comment says "MTP" but configuration uses EAGLE algorithm.The comment on line 116 indicates "MTP" (Multi-Token Prediction), but
speculative-algorithmis set to"EAGLE". These are different speculative decoding approaches. If this is intentional (using EAGLE for MTP-style speculation), consider clarifying the comment to avoid confusion.💡 Suggested clarification
- # MTP + # Speculative decoding (EAGLE) speculative-algorithm: "EAGLE"Or if MTP is the intended approach:
# MTP - speculative-algorithm: "EAGLE" + speculative-algorithm: "MTP"recipes/gb300-fp8/1k1k/mtp/max.yaml (1)
26-26: Hardcoded date-based cache directory path.
SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"uses a date-stamped path that may become stale or cause confusion. Consider parameterizing this or documenting the versioning strategy.Also applies to: 45-45
recipes/gb300-fp8/8k1k/mtp/mid.yaml (1)
10-11: Pin the nginx image to a version or digest for reproducibility.
Unpinned tags can change behavior across deploys.✅ Suggested adjustment (use your approved version/digest)
- nginx_container: nginx + nginx_container: "nginx:<pinned-version-or-digest>"
| # Memory and token limits | ||
| mem-fraction-static: 0.75 | ||
| chunked-prefill-size: 36864 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check all max-total-tokens occurrences in gb300-fp8 configs
rg -n "max-total-tokens" recipes/gb300-fp8/ -A2 -B2 | head -100Repository: ishandhanani/srt-slurm
Length of output: 6003
🏁 Script executed:
#!/bin/bash
# Also check the actual file to see the full context
echo "=== Full decode section context ==="
sed -n '123,179p' recipes/gb300-fp8/1k1k/mtp/max.yamlRepository: ishandhanani/srt-slurm
Length of output: 1821
Decode section missing max-total-tokens setting.
The prefill section specifies max-total-tokens: 524288 but the decode section omits this setting. All other configuration files in the repository include this in their prefill sections. Add the explicit configuration for consistency, or document if intentionally omitted.
🤖 Prompt for AI Agents
In `@recipes/gb300-fp8/1k1k/mtp/max.yaml` around lines 157 - 159, The decode
section is missing the max-total-tokens setting; add an explicit
max-total-tokens: 524288 entry under the decode section (matching the prefill
section) to maintain consistency with other recipes and avoid implicit
limits—update the decode block near mem-fraction-static and chunked-prefill-size
so it includes max-total-tokens: 524288.
Summary by CodeRabbit
New Features
Chores