Add all GB200/GB300 FP8 MTP recipes by trevor-m · Pull Request #134 · ishandhanani/srt-slurm

trevor-m · 2026-02-03T01:31:10Z

Summary by CodeRabbit

New Features
- Added numerous FP8 deployment recipes for GB300 and GB200 covering low‑latency, mid‑throughput, and max‑throughput profiles across multiple sequence-length targets. Each includes staged prefill/decode modes, resource/topology, benchmark specs, and extensive runtime/performance tuning (disaggregation, DeepEP/MTP, CUDA graphs, mempool and parallelism knobs).
Chores
- Updated an existing GB200 FP8 config: container image, CUDA‑graph sizing, MTP/speculative tuning, removed DG cache entries, and enabled additional runtime flags.

coderabbitai · 2026-02-03T01:31:27Z

📝 Walkthrough

Walkthrough

Adds many new GB300 and GB200 FP8 deployment recipe YAMLs and updates one GB200 YAML: configuration-only changes for model/runtime images, frontend, resources, per-stage prefill/decode envs, extensive sglang_config tuning (DeepEP/MTP/disaggregation), CUDA-graph lists, and benchmark stanzas.

Changes

Cohort / File(s)	Summary
GB300 FP8 — 1k1k MTP `recipes/gb300-fp8/1k1k/mtp/low-latency.yaml`, `recipes/gb300-fp8/1k1k/mtp/mid.yaml`, `recipes/gb300-fp8/1k1k/mtp/max.yaml`	Three new 1k1k FP8 recipes added. Each defines model/frontend/resources and an extensive `backend` with separate `prefill`/`decode` env maps and detailed `sglang_config` (tp/dp/ep, kv-cache-dtype, attention backend, fp8 quantization), DeepEP/MTP/speculation knobs, CUDA-graph sizing, and sa-bench benchmarks.
GB300 FP8 — 1k8k MTP `recipes/gb300-fp8/1k8k/mtp/low-latency.yaml`, `recipes/gb300-fp8/1k8k/mtp/mid.yaml`, `recipes/gb300-fp8/1k8k/mtp/max.yaml`	Three new 1k8k configs mirroring stage-aware structure: prefill/decode envs, disaggregation/bootstrap settings, DeepEP/MOE and MTP flags, extended cuda-graph batch lists, and benchmark blocks.
GB300 FP8 — 8k1k MTP `recipes/gb300-fp8/8k1k/mtp/low-latency.yaml`, `recipes/gb300-fp8/8k1k/mtp/mid.yaml`, `recipes/gb300-fp8/8k1k/mtp/max.yaml`	Three new 8k1k variants with tuned prefill/decode envs, memory/disaggregation options, speculative execution parameters, CUDA-graph/cuda-graph-max-bs arrays, and sa-bench configurations.
GB200 FP8 — multi-target MTP (new files) `recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml`, `recipes/gb200-fp8/1k8k/low-latency-mtp.yaml`, `recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml`, `recipes/gb200-fp8/1k8k/mid-curve-mtp.yaml`, `recipes/gb200-fp8/8k1k/low-latency-mtp.yaml`, `recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yaml`, `recipes/gb200-fp8/1k8k/mid-curve-mtp.yaml`	Multiple new GB200 FP8 MTP recipes added: per-file prefill/decode env vars, sglang_config with DeepEP/MOE and MTP settings, disaggregation/transfer-backend entries, CUDA-graph lists, mem/token limits, and sa-bench benchmarks.
GB200 FP8 — 1k1k update `recipes/gb200-fp8/1k1k/max-tpt-2p1d-mtp.yaml`	Renamed/updated file: model container updated to `lmsysorg/sglang:v0.5.8-cu130`, added `frontend.nginx_container`, removed DG cache dir, added `SGLANG_JIT_DEEPGEMM_FAST_WARMUP`, enabled `disaggregation-transfer-backend: nixl`, adjusted CUDA-graph bs/max and tightened MTP speculative params.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Add low-latency FP8 disagg with MTP for GB200 #116: Overlaps adding/updating the gb200 FP8 low-latency MTP config and many overlapping env/sglang settings.
add nginx frontend containers #128: Related changes adding frontend.nginx_container across recipe YAMLs.
Update MTP recipe with multiple draft steps #78: Similar MTP/speculative parameter adjustments across recipe files.

Suggested reviewers

ishandhanani
kyleliang-nv
gracehonv

Poem

"I hop through YAML fields by moonlit light,
Prefill, decode, flags all set just right,
FP8 carrots stacked in CUDA-graph rows,
Experts dispatched where the fast breeze blows,
Benchmarks hum softly — pipelines take flight!" 🐇

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Add all GB200/GB300 FP8 MTP recipes' accurately describes the main change: adding multiple new YAML configuration files for GB200 and GB300 FP8 MTP (multi-turn pipeline) deployments across various throughput profiles.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch trevor-m/gb300-mtp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@recipes/gb300-fp8/1k1k/mtp/max.yaml`:
- Around line 1-19: Add a missing slurm section to this recipe so it matches
other recipes (e.g., 1k1k/mtp/mid.yaml): inside the root document add a
top-level "slurm:" mapping containing the same keys used in other recipes (for
example partition, gres/node, cpus_per_task, and any node-specific settings) to
mirror the structure used alongside "model", "frontend", and "resources"; ensure
the section is named "slurm" and placed at root level so tools reading "name",
"model", "frontend", and "resources" will find it consistently.

In `@recipes/gb300-fp8/1k1k/mtp/mid.yaml`:
- Around line 1-19: The recipe is missing a slurm block which other recipes
include; add a slurm section after the existing resources block (same file's
top-level keys: name, model, frontend, resources) and include at minimum
time_limit: "04:00:00" (you can also add other standard keys like partition or
gres if required by your deployment); ensure the new slurm key is top-level and
follows the same YAML indentation/style as the existing blocks so the parser
recognizes it.

In `@recipes/gb300-fp8/8k1k/mtp/max.yaml`:
- Around line 13-19: The configured tensor/ data/ expert parallelism (tp, dp,
ep) massively oversubscribes GPUs: compute total_gpus = prefill_nodes *
gpus_per_node for prefill and decode_nodes * gpus_per_node for decode, then
ensure tp * dp * ep <= total_gpus (or increase nodes/gpus accordingly). Update
the tp/dp/ep values in the prefill and decode sections (and mirror the same
changes in the duplicate blocks referenced around lines 71-75 and 131-134) so
the product of tp×dp×ep does not exceed available GPUs, or alternatively
increase prefill_nodes/decode_nodes or gpus_per_node to match the desired
parallelism. Ensure consistency between prefill_workers/decode_workers and the
adjusted parallelism so world_size calculations in SGLang remain correct.
- Around line 67-69: The recipe sets served-model-name "deepseek-ai/DeepSeek-R1"
with trust-remote-code: true without a pinned revision; update the configuration
to either pin the model reference (e.g., append @<commit-hash> to the
served-model-name), or change the source to an internal mirror, or add a clear
security rationale comment in the recipe explaining why trust-remote-code is
acceptable and what vetting was performed; apply the same fix for any other
blocks that set trust-remote-code alongside served-model-name in this recipe.

In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml`:
- Around line 67-70: The recipe references the remote repo
"deepseek-ai/DeepSeek-R1" with trust-remote-code: true which allows arbitrary
remote code execution; update both occurrences of the served model reference in
the prefill and decode sections (the "served-model-name" fields) to pin to a
specific commit or tag (e.g., "deepseek-ai/DeepSeek-R1@<commit_sha>") or switch
to an internally mirrored model ID, and if you intentionally must keep an
unpinned reference, add a short documented rationale and note any compensating
controls (sandboxing, code review process) adjacent to the served-model-name and
trust-remote-code settings so reviewers can verify the risk mitigation.
- Around line 1-12: The container tag specified in model.container
("lmsysorg/sglang:v0.5.8-cu130-runtime") appears invalid; update the tag to a
known-public image (e.g., "lmsysorg/sglang:latest-cu130-runtime" or
"lmsysorg/sglang:v0.5.7-cu130-runtime") or confirm and document that
"v0.5.8-cu130-runtime" exists in your private registry; change the
model.container value accordingly and ensure any deployment/configuration docs
reference the chosen tag.

🧹 Nitpick comments (5)

recipes/gb300-fp8/1k8k/mtp/max.yaml (1)

175-176: cuda-graph-bs list contains batch sizes exceeding cuda-graph-max-bs.

The cuda-graph-bs list includes values up to 1024, but cuda-graph-max-bs is set to 512. Batch sizes beyond 512 in the list (544, 576, 608, 640, 672, 704, 736, 768, 1024) will not be used for CUDA graph capture. Consider removing these unused values to improve clarity, or adjust cuda-graph-max-bs if larger batch sizes are intended.

recipes/gb300-fp8/1k8k/mtp/mid.yaml (1)

175-176: cuda-graph-bs list contains unused batch sizes.

Same issue as max.yaml: the list includes batch sizes up to 1024 while cuda-graph-max-bs is 512.

recipes/gb300-fp8/1k1k/mtp/max.yaml (1)

172-173: cuda-graph-bs list contains unused batch sizes.

Same issue as other max/mid configs: the list includes batch sizes up to 1024 while cuda-graph-max-bs is 512. Consider trimming the list or adjusting max-bs.

recipes/gb300-fp8/8k1k/mtp/max.yaml (1)

23-63: Confirm the DG cache directory is mounted for both stages.

Both prefill and decode set SGLANG_DG_CACHE_DIR: /configs/dg-10212025; please ensure this path is present and mounted into the container on all prefill/decode nodes to avoid startup failures or cache misses.

recipes/gb300-fp8/8k1k/mtp/mid.yaml (1)

23-63: Confirm the DG cache directory is mounted for both stages.

Both prefill and decode set SGLANG_DG_CACHE_DIR: /configs/dg-10212025; please ensure this path is present and mounted into the container on all prefill/decode nodes to avoid startup failures or cache misses.

coderabbitai · 2026-02-03T01:39:41Z

+# GB300 FP8 Max Throughput Configuration
+
+name: "gb300-1k1k-fp8-max"
+
+model:
+  path: "dsfp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 2 
+  decode_workers: 1
+  gpus_per_node: 4


⚠️ Potential issue | 🟡 Minor

Missing slurm section.

Similar to 1k1k/mtp/mid.yaml, this file is missing the slurm section. Please add it for consistency with other recipe files.

Suggested fix

resources: gpu_type: "gb300" prefill_nodes: 2 prefill_workers: 1 decode_nodes: 2 decode_workers: 1 gpus_per_node: 4 +slurm: + time_limit: "04:00:00" + backend:

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/1k1k/mtp/max.yaml` around lines 1 - 19, Add a missing slurm section to this recipe so it matches other recipes (e.g., 1k1k/mtp/mid.yaml): inside the root document add a top-level "slurm:" mapping containing the same keys used in other recipes (for example partition, gres/node, cpus_per_task, and any node-specific settings) to mirror the structure used alongside "model", "frontend", and "resources"; ensure the section is named "slurm" and placed at root level so tools reading "name", "model", "frontend", and "resources" will find it consistently.

coderabbitai · 2026-02-03T01:39:41Z

+# GB300 FP8 Mid Throughput Configuration
+name: "gb300-1k1k-fp8-mid"
+
+model:
+  path: "dsfp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+
+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 4
+  prefill_workers: 2
+  decode_nodes: 8 
+  decode_workers: 1
+  gpus_per_node: 4
+


⚠️ Potential issue | 🟡 Minor

Missing slurm section.

This configuration file is missing the slurm section that other recipe files include (e.g., time_limit: "04:00:00"). This may cause issues or use unexpected defaults when the recipe is deployed.

Suggested fix

Add the slurm section after the resources block:

resources: gpu_type: "gb300" prefill_nodes: 4 prefill_workers: 2 decode_nodes: 8 decode_workers: 1 gpus_per_node: 4 +slurm: + time_limit: "04:00:00" + backend:

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/1k1k/mtp/mid.yaml` around lines 1 - 19, The recipe is missing a slurm block which other recipes include; add a slurm section after the existing resources block (same file's top-level keys: name, model, frontend, resources) and include at minimum time_limit: "04:00:00" (you can also add other standard keys like partition or gres if required by your deployment); ensure the new slurm key is top-level and follows the same YAML indentation/style as the existing blocks so the parser recognizes it.

coderabbitai · 2026-02-03T01:39:42Z

+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 12 
+  prefill_workers: 6
+  decode_nodes: 6 
+  decode_workers: 1
+  gpus_per_node: 4


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, find and read the target file if [ -f "recipes/gb300-fp8/8k1k/mtp/max.yaml" ]; then echo "=== File exists, reading content ===" cat -n recipes/gb300-fp8/8k1k/mtp/max.yaml else echo "=== File not found, searching for similar patterns ===" find . -name "*.yaml" -path "*/gb300*" -type f 2>/dev/null | head -20 fi

Repository: ishandhanani/srt-slurm

Length of output: 6697

🏁 Script executed:

#!/bin/bash # Check if recipes/gb300-fp8 exists and list yaml files if [ -d "recipes/gb300-fp8" ]; then echo "=== Directory structure ===" find recipes/gb300-fp8 -name "*.yaml" -type f | head -20 else echo "=== Searching for gb300 recipes ===" find . -type d -name "*gb300*" 2>/dev/null fi

Repository: ishandhanani/srt-slurm

Length of output: 789

🏁 Script executed:

#!/bin/bash # Verify tp/dp/ep sizing vs available GPUs across gb300 recipes (assumes world_size = tp*dp*ep). python3 <<'PY' from pathlib import Path import yaml def world(cfg): tp = int(cfg.get("tp-size", 1)) dp = int(cfg.get("dp-size", 1)) ep = int(cfg.get("ep-size", 1)) return tp * dp * ep, tp, dp, ep for path in sorted(Path("recipes/gb300-fp8").rglob("*.yaml")): try: data = yaml.safe_load(path.read_text()) res = data.get("resources", {}) gpn = res.get("gpus_per_node") pn = res.get("prefill_nodes") dn = res.get("decode_nodes") prefill_gpus = gpn * pn if gpn and pn else None decode_gpus = gpn * dn if gpn and dn else None sg = data.get("backend", {}).get("sglang_config", {}) pre = sg.get("prefill", {}) dec = sg.get("decode", {}) if pre and prefill_gpus: w, tp, dp, ep = world(pre) if w > prefill_gpus: print(f"{path.name}: prefill world_size {w} > prefill_gpus {prefill_gpus} (tp={tp},dp={dp},ep={ep})") if dec and decode_gpus: w, tp, dp, ep = world(dec) if w > decode_gpus: print(f"{path.name}: decode world_size {w} > decode_gpus {decode_gpus} (tp={tp},dp={dp},ep={ep})") except Exception as e: pass PY

Repository: ishandhanani/srt-slurm

Length of output: 1717

Parallelism factors exceed available GPUs by orders of magnitude.

Prefill: 48 GPUs (12 nodes × 4) configured with tp/dp/ep = 8/8/8 = 512 ranks (10.7× oversubscription). Decode: 24 GPUs (6 nodes × 4) configured with tp/dp/ep = 24/24/24 = 13,824 ranks (576× oversubscription). If SGLang uses world_size = tp × dp × ep, this configuration will fail to launch. Reduce tp/dp/ep values or scale GPU resources accordingly.

Also applies to lines 71-75 (prefill parallelism) and 131-134 (decode parallelism).

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/8k1k/mtp/max.yaml` around lines 13 - 19, The configured tensor/ data/ expert parallelism (tp, dp, ep) massively oversubscribes GPUs: compute total_gpus = prefill_nodes * gpus_per_node for prefill and decode_nodes * gpus_per_node for decode, then ensure tp * dp * ep <= total_gpus (or increase nodes/gpus accordingly). Update the tp/dp/ep values in the prefill and decode sections (and mirror the same changes in the duplicate blocks referenced around lines 71-75 and 131-134) so the product of tp×dp×ep does not exceed available GPUs, or alternatively increase prefill_nodes/decode_nodes or gpus_per_node to match the desired parallelism. Ensure consistency between prefill_workers/decode_workers and the adjusted parallelism so world_size calculations in SGLang remain correct.

coderabbitai · 2026-02-03T01:39:42Z

+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "max.yaml" -path "*/gb300-fp8/8k1k/mtp/*" 2>/dev/null | head -20

Repository: ishandhanani/srt-slurm

Length of output: 103

🏁 Script executed:

wc -l ./recipes/gb300-fp8/8k1k/mtp/max.yaml

Repository: ishandhanani/srt-slurm

Length of output: 107

🏁 Script executed:

cat -n ./recipes/gb300-fp8/8k1k/mtp/max.yaml | sed -n '60,135p'

Repository: ishandhanani/srt-slurm

Length of output: 2539

🏁 Script executed:

grep -r "revision\|DeepSeek" ./recipes/gb300-fp8/8k1k/mtp/ 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 615

🏁 Script executed:

grep -r "trust-remote-code" ./recipes/gb300-fp8/8k1k/mtp/ 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 489

🏁 Script executed:

grep -r "deepseek\|security\|revision" ./recipes/ 2>/dev/null | grep -i "doc\|readme\|comment\|note" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

find ./recipes -name "README*" -o -name "*.md" | head -10

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

find . -maxdepth 2 -name "README*" -o -name "*.md" | head -10

Repository: ishandhanani/srt-slurm

Length of output: 239

🏁 Script executed:

grep -r "trust-remote-code\|DeepSeek" ./docs/ ./README.md 2>/dev/null | head -20

Repository: ishandhanani/srt-slurm

Length of output: 551

🏁 Script executed:

grep -B5 -A5 "trust-remote-code" ./recipes/gb300-fp8/8k1k/mtp/max.yaml | head -30

Repository: ishandhanani/srt-slurm

Length of output: 569

🏁 Script executed:

grep -r "skip-tokenizer-init\|trust-remote" . --include="*.md" --include="*.txt" 2>/dev/null | head -10

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

cat ./docs/installation.md | grep -A10 -B10 "DeepSeek\|trust"

Repository: ishandhanani/srt-slurm

Length of output: 1217

🏁 Script executed:

cat ./docs/config-reference.md | head -100

Repository: ishandhanani/srt-slurm

Length of output: 2655

🏁 Script executed:

find . -type f $ -name "*.yaml" -o -name "*.yml" $ -path "*/recipes/*" -exec grep -l "deepseek-ai" {} \; | xargs grep -B2 -A8 "trust-remote-code" | head -40

Repository: ishandhanani/srt-slurm

Length of output: 2949

🏁 Script executed:

grep -r "trust.remote\|remote.code" . --include="*.md" --include="*.txt" 2>/dev/null

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

find . -name "CHANGELOG*" -o -name "SECURITY*" -o -name "SECURITY.md" | head -5

Repository: ishandhanani/srt-slurm

Length of output: 48

Pin model revision or document the security rationale for trust-remote-code: true.

The deepseek-ai/DeepSeek-R1 model loaded with trust-remote-code: true at lines 67-69 and 125-127 allows arbitrary code execution from the model repository without a pinned revision. If this setting is required, either:

Pin a specific vetted commit hash (e.g., deepseek-ai/DeepSeek-R1@<commit-hash>)

Mirror the model internally and load from a controlled source

Document the security assessment and why the risk is acceptable

This also applies to similar configurations in other recipe files.

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/8k1k/mtp/max.yaml` around lines 67 - 69, The recipe sets served-model-name "deepseek-ai/DeepSeek-R1" with trust-remote-code: true without a pinned revision; update the configuration to either pin the model reference (e.g., append @<commit-hash> to the served-model-name), or change the source to an internal mirror, or add a clear security rationale comment in the recipe explaining why trust-remote-code is acceptable and what vetting was performed; apply the same fix for any other blocks that set trust-remote-code alongside served-model-name in this recipe.

coderabbitai · 2026-02-03T01:39:42Z

+# GB300 FP8 Mid Throughput Configuration
+
+name: "gb300-8k1k-fp8-mid"
+
+model:
+  path: "dsfp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
+  precision: "fp8"
+
+frontend:
+  nginx_container: nginx
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

SGLang v0.5.8 cu130 GB300 FP8 support release notes

💡 Result:

SGLang v0.5.8 (Jan 23, 2026) — upstream release notes

GitHub tag v0.5.8 highlights include DeepSeek V3.2 optimizations (incl. FP8 KV cache mention), FlashAttention 4 decoding kernels, SGLang-Diffusion updates, and dependency bumps (e.g., sgl-kernel → 0.3.21). [1]

CUDA 13.0 (“cu130”) + GB300 + FP8 — release notes / support notes

NVIDIA’s SGLang container release notes (CUDA 13.0.x-based) explicitly list CUDA 13.0 compatibility, GB300/B300 support, and FP8 precision support (Hopper and above). [2]

SGLang installation docs note that for CUDA 13 / B300 / GB300, Docker is recommended, and alternatively you can install the matching sgl_kernel +cu130 wheel from the sgl-project wheel releases. [3]

NVIDIA forum post also points to the cu130 kernel wheel index for SGLang. [4]

Sources:
[1] (github.com)
[2] (docs.nvidia.com)
[3] (docs.sglang.io)
[4] (forums.developer.nvidia.com)

Citations:

1: https://github.com/sgl-project/sglang/releases

2: https://docs.nvidia.com/deeplearning/frameworks/sglang-release-notes/rel-25-11.html?utm_source=openai

3: https://docs.sglang.io/get_started/install.html?utm_source=openai

4: https://forums.developer.nvidia.com/t/run-sglang-in-thor/348815/3?utm_source=openai

🌐 Web query:

SGLang GB300 support when added

💡 Result:

SGLang’s GB300/B300 support is called out as added in NVIDIA’s SGLang container Release 25.11 (“GB300/B300 support” is listed under Key Features and Enhancements). The NVIDIA release note page was last updated Jan 30, 2026. [1]

SGLang’s own install docs also include a troubleshooting note specifically for B300/GB300 (“ptxas fatal… sm_103a…”) indicating operational support in current documentation. [2]

Sources: [1] [2]

🌐 Web query:

lmsysorg sglang v0.5.8 cu130 runtime container

💡 Result:

I couldn’t find evidence that a lmsysorg/sglang:v0.5.8-cu130-runtime image tag exists on Docker Hub (at least publicly indexed). The CUDA 13.0 “runtime” images that are referenced/available are:

Stable alias (recommended for CUDA 13 setups): lmsysorg/sglang:latest-cu130-runtime (docs.sglang.io)

A known versioned CUDA 13 runtime tag: lmsysorg/sglang:v0.5.7-cu130-runtime (docker.aityp.com)

Commands:

docker pull lmsysorg/sglang:latest-cu130-runtime # or docker pull lmsysorg/sglang:v0.5.7-cu130-runtime

Citations:

1: https://docs.sglang.io/get_started/install.html?utm_source=openai

2: https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime?utm_source=openai

The SGLang container tag v0.5.8-cu130-runtime does not appear to exist; use latest-cu130-runtime or verify this is a private/internal image.

GB300 and FP8 support are confirmed in NVIDIA SGLang Release 25.11 with CUDA 13.0, but the specified versioned tag v0.5.8-cu130-runtime was not found on public Docker Hub. Available alternatives are latest-cu130-runtime (stable, recommended) or v0.5.7-cu130-runtime. Confirm whether the v0.5.8 tag is available in your private registry or switch to an existing public tag to avoid deployment failure.

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml` around lines 1 - 12, The container tag specified in model.container ("lmsysorg/sglang:v0.5.8-cu130-runtime") appears invalid; update the tag to a known-public image (e.g., "lmsysorg/sglang:latest-cu130-runtime" or "lmsysorg/sglang:v0.5.7-cu130-runtime") or confirm and document that "v0.5.8-cu130-runtime" exists in your private registry; change the model.container value accordingly and ensure any deployment/configuration docs reference the chosen tag.

coderabbitai · 2026-02-03T01:39:42Z

+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the YAML file find . -type f -name "mid.yaml" | head -5

Repository: ishandhanani/srt-slurm

Length of output: 255

🏁 Script executed:

# Also search for the file path pattern find . -type f -path "*gb300-fp8*" -name "*.yaml" | head -10

Repository: ishandhanani/srt-slurm

Length of output: 477

🌐 Web query:

Hugging Face trust_remote_code security risks and best practices

💡 Result:

What trust_remote_code=True does (and why it’s risky)

It allows executing Python code pulled from a Hugging Face Hub repo (e.g., custom modeling_*.py, tokenization_*.py, pipeline.py) when loading with Transformers “Auto” classes (and similarly enables execution of dataset loading scripts in datasets). This can lead to arbitrary code execution (RCE) if the repo is malicious or compromised. [1]

Even without remote code, loading pickled weight formats (common in PyTorch .bin / some checkpoints) can also enable code execution during deserialization. Hugging Face explicitly warns that pickle can be abused for arbitrary code execution. [2]

Threat model / concrete attack paths

Malicious custom modeling/pipeline code runs at import/load time once you opt in with trust_remote_code=True. [1]

Repo updates as a supply-chain risk: if you load by branch name (e.g., default main) and don’t pin a commit, a later update could introduce malicious code. Hugging Face recommends pinning a revision to protect against repo updates. [1]

Pickle-based exploits in weight files can bypass expectations; HF provides scanning and warnings, but also states their approach is not foolproof. [2]

Platform reality: the Hub scans files for malware on every commit, but that’s not a guarantee of safety. [3]

Best practices (practical checklist)

Don’t use trust_remote_code=True unless you must. Prefer models that work with built-in Transformers architectures. [1]

If you must use it, review the repo code first (the exact Python files that will be imported/executed). HF explicitly recommends verifying modeling files before enabling it. [1]

Pin to an immutable revision (commit hash or tag) when using remote code, to avoid silently pulling new code later. [1]

Example (conceptual): load with revision="<commit_sha>" and keep that SHA under change control.

Prefer safetensors and enforce it:

In Transformers, use use_safetensors=True to avoid unsafe weight formats (and fail if only unsafe weights exist). [1]

More generally, prefer safetensors over pickle-based weights. [2]

Treat Hub scanning as defense-in-depth, not trust:

HF runs malware scanning on repository files at each commit. [3]

HF also performs pickle-related scanning/inspection and highlights suspicious imports, but it’s not a complete guarantee. [2]

Consider additional scanning (HF integrates third-party scanning like Protect AI). [4]

Sandbox execution when evaluating untrusted/community repos:

Run in an isolated container/VM, least-privilege user, read-only mounts where possible, and restrict outbound network/credentials (because remote code runs as you).

For datasets: avoid executing dataset loading scripts unless necessary; datasets introduced trust_remote_code for custom dataset code and provides an environment variable to disable trusting remote code by default (HF_DATASETS_TRUST_REMOTE_CODE=0). [5]

Key takeaway

trust_remote_code=True is an explicit opt-in to run third-party Python from a model/dataset repo—so handle it like any other supply-chain dependency: review, pin, sandbox, and avoid pickle where possible. [1][2]

Sources:
[1] Transformers security policy (remote code + pin revision + safetensors guidance) (github.com)
[2] Hugging Face Hub “Pickle Scanning” (pickle RCE risk + mitigations + not foolproof) (huggingface.co)
[3] Hugging Face Hub “Malware Scanning” (scans every file each commit) (huggingface.co)
[4] Hugging Face Hub “Third-party scanner: Protect AI” (huggingface.co)
[5] datasets release notes describing trust_remote_code and HF_DATASETS_TRUST_REMOTE_CODE=0 (newreleases.io)

Citations:

1: https://github.com/huggingface/transformers/security/policy?utm_source=openai

2: https://huggingface.co/docs/hub/security-pickle?utm_source=openai

3: https://huggingface.co/docs/hub/security-malware?utm_source=openai

4: https://huggingface.co/docs/hub/en/security-protectai?utm_source=openai

5: https://newreleases.io/project/github/huggingface/datasets/release/2.16.0?utm_source=openai

🏁 Script executed:

# Read the YAML file to see the context around lines 67-70 and 125-127 cat -n ./recipes/gb300-fp8/8k1k/mtp/mid.yaml

Repository: ishandhanani/srt-slurm

Length of output: 6651

Pin the model revision or mirror internally; trust-remote-code: true without a pinned commit hash enables arbitrary code execution.

The deepseek-ai/DeepSeek-R1 reference (lines 67, 125) loads from the latest repository state; any subsequent update to the model repo could execute malicious code. Specify a commit hash or tag (e.g., "deepseek-ai/DeepSeek-R1@<commit_sha>") to prevent silent code execution, mirror the model internally if supported by your loader, or document the rationale and any additional security controls (e.g., sandboxing, code review process).

This applies to both prefill (lines 67–70) and decode (lines 125–127) sections.

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml` around lines 67 - 70, The recipe references the remote repo "deepseek-ai/DeepSeek-R1" with trust-remote-code: true which allows arbitrary remote code execution; update both occurrences of the served model reference in the prefill and decode sections (the "served-model-name" fields) to pin to a specific commit or tag (e.g., "deepseek-ai/DeepSeek-R1@<commit_sha>") or switch to an internally mirrored model ID, and if you intentionally must keep an unpinned reference, add a short documented rationale and note any compensating controls (sandboxing, code review process) adjacent to the served-model-name and trust-remote-code settings so reviewers can verify the risk mitigation.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml`:
- Around line 77-81: The recipe sets context-length to 10000 but
max-total-tokens to 8192, which risks rejecting/truncating full-length requests;
update either max-total-tokens to at least 10000 (or higher to allow headroom)
or reduce context-length to <=8192 so they align—modify the values for
context-length and/or max-total-tokens in this config (look for the entries
named "context-length" and "max-total-tokens") to ensure max-total-tokens >=
context-length.

In `@recipes/gb200-fp8/8k1k/low-latency-mtp.yaml`:
- Around line 1-12: The recipe name's node-count suffix ("name":
"gb200-fp8-8k1k-1p-1d-low-latency-mtp") does not match the declared resources
(decode_nodes: 2); either update the name to reflect 1p-2d (change "1p-1d" to
"1p-2d") or change decode_nodes to 1 so the "name" and the resources align; edit
the top-level name field or the decode_nodes field accordingly to resolve the
mismatch.

🧹 Nitpick comments (2)

recipes/gb200-fp8/8k1k/low-latency-mtp.yaml (1)
16-58: Consider YAML anchors to reduce env duplication.

Prefill and decode environments are nearly identical. Anchors make it harder for them to drift.
♻️ Example anchor pattern
 backend:
-  prefill_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+  common_environment: &common_env
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    # ...keep the remaining shared keys here...
+  prefill_environment:
+    <<: *common_env
     SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
     # ...
 
-  decode_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+  decode_environment:
+    <<: *common_env
     SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
     SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
     # ...
recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yaml (1)

3-3: Align recipe name with the filename/tier for discoverability.

The file path says “mid-curve-5p1d”, but the name reads “mid-tpt”. Consider renaming for clarity.

coderabbitai · 2026-02-03T02:00:05Z

+      context-length: 10000
+      disaggregation-mode: "prefill"
+      mem-fraction-static: 0.95
+      max-total-tokens: 8192
+      chunked-prefill-size: 8192


⚠️ Potential issue | 🟠 Major

max-total-tokens is lower than context-length.

This can cause full-length requests to be rejected or truncated. Please align these values (either raise max-total-tokens or reduce context-length).

🐛 Example fix (raise token cap)

- max-total-tokens: 8192 + max-total-tokens: 10000

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

context-length: 10000

disaggregation-mode: "prefill"

mem-fraction-static: 0.95

max-total-tokens: 8192

chunked-prefill-size: 8192

context-length: 10000

disaggregation-mode: "prefill"

mem-fraction-static: 0.95

max-total-tokens: 10000

chunked-prefill-size: 8192

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml` around lines 77 - 81, The recipe sets context-length to 10000 but max-total-tokens to 8192, which risks rejecting/truncating full-length requests; update either max-total-tokens to at least 10000 (or higher to allow headroom) or reduce context-length to <=8192 so they align—modify the values for context-length and/or max-total-tokens in this config (look for the entries named "context-length" and "max-total-tokens") to ensure max-total-tokens >= context-length.

coderabbitai · 2026-02-03T02:00:05Z

+name: "gb200-fp8-8k1k-1p-1d-low-latency-mtp"
+
+model:
+  path: "dsfp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  decode_nodes: 2
+  prefill_workers: 1


⚠️ Potential issue | 🟡 Minor

Align the recipe name with the node counts.

The name says 1p-1d but decode_nodes: 2. Either rename the recipe to 1p-2d or change decode_nodes to 1 to avoid mislabeling in UIs/logs.

💡 Example fix (rename to match resources)

-name: "gb200-fp8-8k1k-1p-1d-low-latency-mtp" +name: "gb200-fp8-8k1k-1p-2d-low-latency-mtp"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

name: "gb200-fp8-8k1k-1p-1d-low-latency-mtp"

model:

path: "dsfp8"

container: "lmsysorg/sglang:v0.5.8-cu130-runtime"

precision: "fp8"

resources:

gpu_type: "gb200"

prefill_nodes: 1

decode_nodes: 2

prefill_workers: 1

name: "gb200-fp8-8k1k-1p-2d-low-latency-mtp"

model:

path: "dsfp8"

container: "lmsysorg/sglang:v0.5.8-cu130-runtime"

precision: "fp8"

resources:

gpu_type: "gb200"

prefill_nodes: 1

decode_nodes: 2

prefill_workers: 1

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/8k1k/low-latency-mtp.yaml` around lines 1 - 12, The recipe name's node-count suffix ("name": "gb200-fp8-8k1k-1p-1d-low-latency-mtp") does not match the declared resources (decode_nodes: 2); either update the name to reflect 1p-2d (change "1p-1d" to "1p-2d") or change decode_nodes to 1 so the "name" and the resources align; edit the top-level name field or the decode_nodes field accordingly to resolve the mismatch.

coderabbitai

Actionable comments posted: 11

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml`:
- Around line 1-3: The top comment incorrectly reads "Max Throughput
Configuration"; update the header comment to match the actual recipe type by
changing it to indicate "Mid-Curve Configuration" (or similar) so it aligns with
the name "gb200-fp8-mid-curve-3p1d-mtp" and the filename; ensure the descriptive
comment at the top reflects mid-curve rather than max-throughput for clarity.

In `@recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml`:
- Line 71: The YAML sets trust-remote-code: true for the DeepSeek-R1 model which
allows arbitrary remote code execution; either add a concise security
justification comment next to trust-remote-code explaining why remote code is
required for DeepSeek-R1 (e.g., necessary custom tokenizer/architecture) or pin
the model reference to a specific commit by changing the model identifier to
include a commit hash (e.g., deepseek-ai/DeepSeek-R1@<commit-hash>), or disable
the flag if not needed; update occurrences that match the same pattern (the
trust-remote-code key and the DeepSeek-R1 model reference) so all instances are
consistently justified, pinned, or disabled.

In `@recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yaml`:
- Around line 68-71: The tp-size, dp-size, and ep-size settings (tp-size,
dp-size, ep-size) currently multiply to 512/32,768 which exceeds the available
GPUs and will prevent launch; update the parallelism triplets at both
occurrences (the block around tp-size: 8 / dp-size: 8 / ep-size: 8 and the
similar block later) so that tp-size * dp-size * ep-size equals the actual
available GPU count for the corresponding stage (e.g., set the three values to a
combination whose product equals 40 for the prefill stage and equals 32 for the
decode stage, or otherwise scale them down to the node/GPU capacity you have).

In `@recipes/gb300-fp8/1k1k/mtp/max.yaml`:
- Around line 13-19: The resources block currently requests 2 nodes × 4 GPUs
(gpus_per_node) but the configured parallelism (tp-size, dp-size, ep-size = 8
each for both prefill and decode) yields 512 ranks, massively oversubscribing
hardware; update either the compute allocations (increase
prefill_nodes/decode_nodes or gpus_per_node) or reduce parallelism dimensions
(tp-size, dp-size, ep-size) so total ranks <= available GPUs (prefill_nodes *
gpus_per_node * prefill_workers and decode_nodes * gpus_per_node *
decode_workers); locate and fix the resources keys (prefill_nodes,
prefill_workers, decode_nodes, decode_workers, gpus_per_node) and the
prefill/decode tp-size, dp-size, ep-size settings to make them consistent.
- Around line 67-69: The model configs enable trust-remote-code without pinning
a revision; update the sglang_config for both the prefill model block
(containing served-model-name/skip-tokenizer-init/trust-remote-code) and the
decode model block to include a revision field that passes the desired pinned
commit/tag/branch via the SGLang --revision parameter (e.g., add "revision":
"<commit-or-tag>") so remote code execution is tied to a specific model version.

In `@recipes/gb300-fp8/1k1k/mtp/mid.yaml`:
- Around line 66-68: The recipe currently sets served-model-name
("deepseek-ai/DeepSeek-R1") with trust-remote-code: true without a pinned
revision; update the served-model-name to include a fixed revision (for example
deepseek-ai/DeepSeek-R1@<commit-hash>) or point it to a locally mirrored copy to
prevent automatic unverified code updates, and apply the same change to the
decode section where the model is referenced (the block containing
skip-tokenizer-init, trust-remote-code and the decode section at lines ~124-126)
so both runtime and decoding use the pinned revision.
- Around line 12-18: The recipe oversubscribes GPUs: current resources
(prefill_nodes, prefill_workers, decode_nodes, decode_workers, gpus_per_node)
produce far fewer physical GPUs than the total training ranks implied by the
parallelism (tp, dp, ep). Fix by either lowering the parallelism dimensions
(reduce tp, dp, and/or ep values used for prefill and decode) so total_ranks ≤
gpus_per_node * nodes * gpus_per_node_per_node, or increase
prefill_nodes/decode_nodes (and/or gpus_per_node) so physical GPU count matches
total_ranks; update the prefill_* and decode_* blocks consistently (also adjust
same pattern at the other occurrences noted around lines 71-74 and 130-133) to
ensure ranks-per-GPU stays within acceptable limits.

In `@recipes/gb300-fp8/1k8k/mtp/low-latency.yaml`:
- Around line 76-83: The max-total-tokens value is too low for the 1k8k
workload: with isl=1024 and osl=8192 the total token budget is 9216 and
context-length is 9300, so update the max-total-tokens setting from 8192 to 9216
in the low-latency recipes; modify the max-total-tokens entry in the affected
YAML so it matches the 9216 total (ensure the change is applied in both the MTP
and STP low-latency files where max-total-tokens appears).

In `@recipes/gb300-fp8/1k8k/mtp/mid.yaml`:
- Around line 13-19: The parallelism settings in the resources block (gpu_type,
prefill_nodes, prefill_workers, decode_nodes, decode_workers, gpus_per_node)
imply extreme oversubscription (e.g., 512 and 110,592 ranks) so update the
configuration to match the actual GPU availability: either correct
prefill_nodes/prefill_workers and decode_nodes/decode_workers to produce the
intended total GPU counts, or adjust the tensor/model/data parallel dimensions
(tp/dp/ep) elsewhere in the recipe to equal total_gpus = nodes * gpus_per_node;
ensure tp * dp * ep does not exceed total_gpus for prefill and decode phases and
apply the same correction to the other occurrences referenced (lines ~75-78 and
~134-137).
- Around line 70-73: Add a pinned revision when using trust-remote-code: update
the model spec that contains served-model-name: "deepseek-ai/DeepSeek-R1" (and
the same block at lines ~128-131) to include a revision field (commit SHA or
tag) or point to an internal mirror instead of leaving it unpinned; ensure the
YAML entry next to served-model-name/skip-tokenizer-init/trust-remote-code
includes the equivalent of SGLang’s --revision value to lock the remote code, or
add a short comment explaining why unpinned remote code is acceptable if you
choose not to pin.

In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml`:
- Around line 13-19: The configured tensor/model/data parallelism (tp/dp/ep) is
massively larger than the actual GPUs allocated in resources; compute total GPUs
as nodes * gpus_per_node * workers for prefill (prefill_nodes, prefill_workers,
gpus_per_node) and decode (decode_nodes, decode_workers, gpus_per_node) and
either reduce the tp/dp/ep values so their product (tp * dp * ep) does not
exceed total GPUs or proportionally increase nodes/workers to match the desired
parallelism; update the tp/dp/ep settings referenced around the existing
resource blocks (prefill_* and decode_*) and the other occurrences called out
(lines ~71-74 and ~131-134) so each parallelism product ≤ allocated GPUs.

🧹 Nitpick comments (1)

recipes/gb200-fp8/1k8k/low-latency-mtp.yaml (1)
25-65: JIT DeepGEMM flags are internally inconsistent.

SGLANG_ENABLE_JIT_DEEPGEMM is false but SGLANG_JIT_DEEPGEMM_FAST_WARMUP is 1 in both env blocks. If JIT is intentionally off, drop the warmup flag; if warmup is desired, enable JIT.
♻️ Example fix (remove warmup when JIT is off)
   prefill_environment:
     ...
-    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
     ...
   decode_environment:
     ...
-    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"

coderabbitai · 2026-02-03T17:55:02Z

+# GB200 FP8 Max Throughput Configuration
+
+name: "gb200-fp8-mid-curve-3p1d-mtp"


⚠️ Potential issue | 🟡 Minor

Inconsistent header comment.

The comment on Line 1 says "Max Throughput Configuration" but the name and filename indicate this is a "mid-curve" configuration. This appears to be a copy-paste artifact.

Proposed fix

-# GB200 FP8 Max Throughput Configuration +# GB200 FP8 Mid-Curve 3P1D MTP Configuration

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# GB200 FP8 Max Throughput Configuration

name: "gb200-fp8-mid-curve-3p1d-mtp"

# GB200 FP8 Mid-Curve 3P1D MTP Configuration

name: "gb200-fp8-mid-curve-3p1d-mtp"

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml` around lines 1 - 3, The top comment incorrectly reads "Max Throughput Configuration"; update the header comment to match the actual recipe type by changing it to indicate "Mid-Curve Configuration" (or similar) so it aligns with the name "gb200-fp8-mid-curve-3p1d-mtp" and the filename; ensure the descriptive comment at the top reflects mid-curve rather than max-throughput for clarity.

coderabbitai · 2026-02-03T17:55:02Z

+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml | head -n 150 | tail -n 80

Repository: ishandhanani/srt-slurm

Length of output: 2630

🏁 Script executed:

cat -n recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml | head -n 75 | tail -n 30

Repository: ishandhanani/srt-slurm

Length of output: 1281

🏁 Script executed:

cat recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml | head -n 1 && wc -l recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 144

🏁 Script executed:

cat -n recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 6577

🏁 Script executed:

rg -i "revision|commit|ref" recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 360

🏁 Script executed:

fd "README\|readme\|SECURITY\|security" . --type f | head -20

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

rg -i "deepseek" . --type yaml --type md --type txt -B2 -A2

Repository: ishandhanani/srt-slurm

Length of output: 50378

🏁 Script executed:

git log --oneline --all -- recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml | head -5

Repository: ishandhanani/srt-slurm

Length of output: 109

Add security justification or pin DeepSeek-R1 model to a specific revision.

trust-remote-code: true enables arbitrary code execution from the model repository. This pattern appears throughout the codebase without explicit security justification or model revision pinning. Either document why this setting is required for DeepSeek-R1, pin to a specific commit hash (e.g., deepseek-ai/DeepSeek-R1@<commit-hash>), or disable if not essential.

Also applies to: 129-129

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml` at line 71, The YAML sets trust-remote-code: true for the DeepSeek-R1 model which allows arbitrary remote code execution; either add a concise security justification comment next to trust-remote-code explaining why remote code is required for DeepSeek-R1 (e.g., necessary custom tokenizer/architecture) or pin the model reference to a specific commit by changing the model identifier to include a commit hash (e.g., deepseek-ai/DeepSeek-R1@<commit-hash>), or disable the flag if not needed; update occurrences that match the same pattern (the trust-remote-code key and the DeepSeek-R1 model reference) so all instances are consistently justified, pinned, or disabled.

coderabbitai · 2026-02-03T17:55:02Z

+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "mid-curve-5p1d-mtp.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 114

🏁 Script executed:

cat -n ./recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 6233

🏁 Script executed:

find ./recipes -name "*.yaml" -type f | head -20

Repository: ishandhanani/srt-slurm

Length of output: 910

🏁 Script executed:

head -150 ./recipes/gb200-fp8/8k1k/max-tpt.yaml | tail -100

Repository: ishandhanani/srt-slurm

Length of output: 162

🏁 Script executed:

ls -la ./recipes/gb200-fp8/8k1k/

Repository: ishandhanani/srt-slurm

Length of output: 455

🏁 Script executed:

cat -n ./recipes/gb200-fp8/8k1k/mid-curve-5p1d.yaml | sed -n '60,135p'

Repository: ishandhanani/srt-slurm

Length of output: 2516

🏁 Script executed:

cat -n ./recipes/gb200-fp8/8k1k/low-latency.yaml | sed -n '60,135p'

Repository: ishandhanani/srt-slurm

Length of output: 1910

🏁 Script executed:

find ./recipes -name "README*" -o -name "*.md" | head -10

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

ls -la ./ | grep -i readme

Repository: ishandhanani/srt-slurm

Length of output: 125

🏁 Script executed:

cat ./README.md

Repository: ishandhanani/srt-slurm

Length of output: 1271

🏁 Script executed:

find . -name "*.py" -type f | xargs grep -l "tp-size\|tp_size\|tensor-parallel" | head -10

Repository: ishandhanani/srt-slurm

Length of output: 214

🏁 Script executed:

cat ./analysis/srtlog/config_reader.py

Repository: ishandhanani/srt-slurm

Length of output: 14702

🌐 Web query:

SGLang tp-size dp-size ep-size world size semantics parallelism

💡 Result:

In SGLang, these flags describe how many GPU ranks you allocate to each parallelism dimension, and they must multiply to the total number of ranks (GPUs) you launch.

tp-size (Tensor Parallel / TP): splits a single model replica’s large weight matrices across tp-size GPUs (model-sharding within a replica). [4]

dp-size (Data Parallel / DP): runs dp-size independent replicas of the model (each replica gets its own share of requests). [3]

ep-size (Expert Parallel / EP): for MoE models, shards/places experts across ep-size GPUs to reduce expert-weight memory and enable MoE scaling. [1][2]

world size: total number of distributed ranks participating in the job; in multi-node setups it’s typically nnodes × gpus_per_node. [3]

How they relate (the key semantic rule)

SGLang validates that the total GPUs/ranks used by the serving engine equal the product of the parallelism sizes:

[
\textbf{total_gpus} = tp_size \times pp_size \times ep_size \times dp_size
]
(where pp_size is pipeline parallelism, if used). [3]

Important EP constraint (common gotcha)

Some EP backends (e.g., DeepEP/Mooncake) only support ep_size = tp_size; if you want hybrid EP+TP with ep_size < tp_size, SGLang may restrict you to other (less optimized) dispatch backends. [1]

🏁 Script executed:

grep -n "prefill_nodes\|prefill_workers\|decode_nodes\|decode_workers\|gpus_per_node" ./recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yaml

Repository: ishandhanani/srt-slurm

Length of output: 180

Critical GPU allocation mismatch: parallelism settings incompatible with available GPUs.

Prefill requires 512 GPUs (8×8×8) but only 40 are available (10 nodes × 4 GPUs); decode requires 32,768 GPUs (32×32×32) but only 32 are available (8 nodes × 4 GPUs). SGLang will fail to launch. Align tp-size, dp-size, and ep-size so their product equals available GPU count per stage. Same issue at lines 126-129.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/8k1k/mid-curve-5p1d-mtp.yaml` around lines 68 - 71, The tp-size, dp-size, and ep-size settings (tp-size, dp-size, ep-size) currently multiply to 512/32,768 which exceeds the available GPUs and will prevent launch; update the parallelism triplets at both occurrences (the block around tp-size: 8 / dp-size: 8 / ep-size: 8 and the similar block later) so that tp-size * dp-size * ep-size equals the actual available GPU count for the corresponding stage (e.g., set the three values to a combination whose product equals 40 for the prefill stage and equals 32 for the decode stage, or otherwise scale them down to the node/GPU capacity you have).

coderabbitai · 2026-02-03T17:55:03Z

+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n recipes/gb300-fp8/1k1k/mtp/max.yaml | head -80

Repository: ishandhanani/srt-slurm

Length of output: 2945

🏁 Script executed:

sed -n '120,135p' recipes/gb300-fp8/1k1k/mtp/max.yaml

Repository: ishandhanani/srt-slurm

Length of output: 441

🏁 Script executed:

find recipes -name "*.yaml" -type f | head -10 | xargs grep -l "served-model-name"

Repository: ishandhanani/srt-slurm

Length of output: 446

🏁 Script executed:

grep -A2 -B2 "trust-remote-code" recipes/gb200-fp4/1k1k/max-tpt.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 350

🏁 Script executed:

grep -r "revision\|commit" recipes/gb300-fp8/1k1k/mtp/ || echo "No revision/commit fields found"

Repository: ishandhanani/srt-slurm

Length of output: 97

🏁 Script executed:

find . -name "README*" -o -name "*.md" | xargs grep -l "model\|revision" | head -5

Repository: ishandhanani/srt-slurm

Length of output: 171

🏁 Script executed:

grep -A10 -B5 "served-model-name" docs/config-reference.md

Repository: ishandhanani/srt-slurm

Length of output: 1010

🏁 Script executed:

grep -r "HuggingFace\|model.*revision\|huggingface" docs/ --include="*.md" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

rg "served-model-name|trust-remote-code" --type python | head -20

Repository: ishandhanani/srt-slurm

Length of output: 96

🏁 Script executed:

git ls-files | grep -E "\.(py|sh)$" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 687

🏁 Script executed:

cat -n analysis/srtlog/config_reader.py | head -100

Repository: ishandhanani/srt-slurm

Length of output: 3828

🌐 Web query:

SGLang served-model-name revision commit pinning trust-remote-code

💡 Result:

In SGLang (CLI: python -m sglang.launch_server):

--served-model-name: overrides the model name returned by the OpenAI-compatible GET /v1/models endpoint. [1]

--revision: selects a specific Hugging Face Hub model version; it can be a branch name, tag, or commit id (i.e., commit pinning). [1]

--trust-remote-code: enables loading models that require custom code from the Hub (equivalent idea to Transformers’ “trust remote code”). Default is False; pass the flag to enable. [1]

Example (pin to a specific commit):

python -m sglang.launch_server \ --model-path org/model-repo \ --revision <commit_sha_or_tag_or_branch> \ --served-model-name my-model \ --trust-remote-code

Hugging Face’s Hub tooling also documents that revision can be a branch/tag/commit hash when downloading snapshots/files, which is the same concept SGLang exposes via --revision. [2]

Sources: [1] SGLang server arguments docs, [2] Hugging Face Hub download docs.

🏁 Script executed:

grep -r "revision:" recipes/ --include="*.yaml" | head -5

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

grep -i "revision" docs/config-reference.md

Repository: ishandhanani/srt-slurm

Length of output: 48

Add revision pinning to model configuration when using trust-remote-code: true.

Both the prefill (lines 67–69) and decode (lines 125–127) sections enable trust-remote-code without pinning the model revision. Unpinned remote code can change and execute arbitrary code. Use SGLang's --revision parameter to pin a specific commit, tag, or branch—add a revision field to the sglang_config to specify the exact model version.

Affected locations

Lines 67–69: prefill model config

Lines 125–127: decode model config

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/1k1k/mtp/max.yaml` around lines 67 - 69, The model configs enable trust-remote-code without pinning a revision; update the sglang_config for both the prefill model block (containing served-model-name/skip-tokenizer-init/trust-remote-code) and the decode model block to include a revision field that passes the desired pinned commit/tag/branch via the SGLang --revision parameter (e.g., add "revision": "<commit-or-tag>") so remote code execution is tied to a specific model version.

coderabbitai · 2026-02-03T17:55:03Z

+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Locate and examine the YAML file fd "mid.yaml" recipes/

Repository: ishandhanani/srt-slurm

Length of output: 281

🏁 Script executed:

# Read the file to see the context around lines 66-68 and 124-126 cat -n recipes/gb300-fp8/1k1k/mtp/mid.yaml | sed -n '60,130p'

Repository: ishandhanani/srt-slurm

Length of output: 2406

🏁 Script executed:

# Search for served-model-name patterns to see if pinning is used elsewhere rg "served-model-name" --type yaml -A 1 -B 1

Repository: ishandhanani/srt-slurm

Length of output: 31391

🏁 Script executed:

# Look for any documentation or examples about model pinning fd -e md -e txt -e rst | xargs rg -l "revision|pin" 2>/dev/null | head -10

Repository: ishandhanani/srt-slurm

Length of output: 176

🏁 Script executed:

# Check if there are any other similar configurations with different patterns rg "trust-remote-code" --type yaml -B 2 -A 2

Repository: ishandhanani/srt-slurm

Length of output: 45849

Pin DeepSeek-R1 revision when trust-remote-code is enabled.

Unpinned remote code allows automatic fetching of model updates without integrity verification, introducing supply-chain RCE risk. Specify a fixed revision (e.g., deepseek-ai/DeepSeek-R1@<commit-hash>) or use a locally mirrored copy.

Also applies to: lines 124-126 (decode section)

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/1k1k/mtp/mid.yaml` around lines 66 - 68, The recipe currently sets served-model-name ("deepseek-ai/DeepSeek-R1") with trust-remote-code: true without a pinned revision; update the served-model-name to include a fixed revision (for example deepseek-ai/DeepSeek-R1@<commit-hash>) or point it to a locally mirrored copy to prevent automatic unverified code updates, and apply the same change to the decode section where the model is referenced (the block containing skip-tokenizer-init, trust-remote-code and the decode section at lines ~124-126) so both runtime and decoding use the pinned revision.

coderabbitai · 2026-02-03T17:55:03Z

+      watchdog-timeout: 1000000
+      context-length: 9300 
+      disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: nixl
+      mem-fraction-static: 0.95
+      max-total-tokens: 8192 
+      chunked-prefill-size: 8192
+      max-prefill-tokens: 8192 


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "low-latency.yaml" -path "*gb300-fp8*" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 341

🏁 Script executed:

wc -l ./recipes/gb300-fp8/1k8k/mtp/low-latency.yaml

Repository: ishandhanani/srt-slurm

Length of output: 115

🏁 Script executed:

cat -n ./recipes/gb300-fp8/1k8k/mtp/low-latency.yaml | sed -n '70,90p'

Repository: ishandhanani/srt-slurm

Length of output: 962

🏁 Script executed:

cat -n ./recipes/gb300-fp8/1k8k/mtp/low-latency.yaml | sed -n '125,136p'

Repository: ishandhanani/srt-slurm

Length of output: 391

🏁 Script executed:

cat -n ./recipes/gb300-fp8/1k1k/mtp/low-latency.yaml | sed -n '70,90p'

Repository: ishandhanani/srt-slurm

Length of output: 961

🏁 Script executed:

cat -n ./recipes/gb300-fp8/8k1k/mtp/low-latency.yaml | sed -n '70,90p'

Repository: ishandhanani/srt-slurm

Length of output: 965

🏁 Script executed:

grep -n "max-total-tokens" ./recipes/gb300-fp8/1k8k/mtp/low-latency.yaml

Repository: ishandhanani/srt-slurm

Length of output: 98

🏁 Script executed:

cat -n ./recipes/gb300-fp8/1k8k/stp/low-latency.yaml | sed -n '70,90p'

Repository: ishandhanani/srt-slurm

Length of output: 897

🏁 Script executed:

grep -A 5 "^benchmark:" ./recipes/gb300-fp8/1k8k/stp/low-latency.yaml

Repository: ishandhanani/srt-slurm

Length of output: 167

Align max-total-tokens with the 1k8k workload in both MTP and STP variants.

The benchmark specifies isl: 1024 and osl: 8192 (total: 9216 tokens), but max-total-tokens is set to 8192. Since context-length is 9300, raise max-total-tokens to match.

🛠️ Proposed adjustment

- max-total-tokens: 8192 + max-total-tokens: 9300

Applies to both:

recipes/gb300-fp8/1k8k/mtp/low-latency.yaml (line 81)

recipes/gb300-fp8/1k8k/stp/low-latency.yaml (line 77)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

watchdog-timeout: 1000000

context-length: 9300

disaggregation-mode: "prefill"

disaggregation-transfer-backend: nixl

mem-fraction-static: 0.95

max-total-tokens: 8192

chunked-prefill-size: 8192

max-prefill-tokens: 8192

watchdog-timeout: 1000000

context-length: 9300

disaggregation-mode: "prefill"

disaggregation-transfer-backend: nixl

mem-fraction-static: 0.95

max-total-tokens: 9300

chunked-prefill-size: 8192

max-prefill-tokens: 8192

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/1k8k/mtp/low-latency.yaml` around lines 76 - 83, The max-total-tokens value is too low for the 1k8k workload: with isl=1024 and osl=8192 the total token budget is 9216 and context-length is 9300, so update the max-total-tokens setting from 8192 to 9216 in the low-latency recipes; modify the max-total-tokens entry in the affected YAML so it matches the 9216 total (ensure the change is applied in both the MTP and STP low-latency files where max-total-tokens appears).

coderabbitai · 2026-02-03T17:55:03Z

+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 2
+  prefill_workers: 1
+  decode_nodes: 12
+  decode_workers: 1
+  gpus_per_node: 4


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "mid.yaml" -o -name "*gb300*" -type f | head -20

Repository: ishandhanani/srt-slurm

Length of output: 293

🏁 Script executed:

git ls-files | grep -E "(recipes|gb300)" | head -30

Repository: ishandhanani/srt-slurm

Length of output: 1315

🏁 Script executed:

cat -n ./recipes/gb300-fp8/1k8k/mtp/mid.yaml

Repository: ishandhanani/srt-slurm

Length of output: 6727

Parallelism configuration requires extreme GPU oversubscription.

Prefill requires 512 ranks (8×8×8) on 8 GPUs; decode requires 110,592 ranks (48×48×48) on 48 GPUs. This produces 64:1 and 2,304:1 ratios respectively. Clarify whether these parallelism dimensions are correctly specified for this hardware setup, or adjust tp/dp/ep to align with available GPU count.

Also applies to: lines 75-78, 134-137

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/1k8k/mtp/mid.yaml` around lines 13 - 19, The parallelism settings in the resources block (gpu_type, prefill_nodes, prefill_workers, decode_nodes, decode_workers, gpus_per_node) imply extreme oversubscription (e.g., 512 and 110,592 ranks) so update the configuration to match the actual GPU availability: either correct prefill_nodes/prefill_workers and decode_nodes/decode_workers to produce the intended total GPU counts, or adjust the tensor/model/data parallel dimensions (tp/dp/ep) elsewhere in the recipe to equal total_gpus = nodes * gpus_per_node; ensure tp * dp * ep does not exceed total_gpus for prefill and decode phases and apply the same correction to the other occurrences referenced (lines ~75-78 and ~134-137).

coderabbitai · 2026-02-03T17:55:03Z

+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -e yaml recipes/gb300-fp8/1k8k/mtp/mid.yaml

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

cat -n recipes/gb300-fp8/1k8k/mtp/mid.yaml | sed -n '60,80p'

Repository: ishandhanani/srt-slurm

Length of output: 765

🏁 Script executed:

cat -n recipes/gb300-fp8/1k8k/mtp/mid.yaml | sed -n '120,140p'

Repository: ishandhanani/srt-slurm

Length of output: 740

🏁 Script executed:

rg "model.*revision\|revision.*:" recipes/gb300-fp8 -B 2 -A 2

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

grep -r "served-model-name" recipes/ | head -20

Repository: ishandhanani/srt-slurm

Length of output: 1847

🌐 Web query:

sglang served-model-name revision parameter trust-remote-code

💡 Result:

In SGLang (python -m sglang.launch_server ...), these CLI flags mean:

--trust-remote-code: Allows loading models that ship custom Python modeling code from the model hub (disabled by default; pass the flag to enable). [1]

--revision <rev>: Pins the model to a specific Hub version: branch name, tag, or commit id (defaults to the repo’s default revision if omitted). [1]

--served-model-name <name>: Overrides the model name returned by the OpenAI-compatible v1/models endpoint (useful if you want clients to see a stable/alias name instead of the repo ID). [1]

Example:

python -m sglang.launch_server \ --model-path Qwen/Qwen2.5-VL-7B-Instruct \ --revision main \ --trust-remote-code \ --served-model-name my-qwen-prod \ --host 0.0.0.0 --port 30000

Sources: [1] [2]

Pin the model revision when using trust-remote-code.

Unpinned remote code can change and execute arbitrarily. SGLang supports --revision parameter to pin a commit/tag. Add revision specification, use an internal mirror, or document why this is acceptable.

Also applies to lines 128-131.

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/1k8k/mtp/mid.yaml` around lines 70 - 73, Add a pinned revision when using trust-remote-code: update the model spec that contains served-model-name: "deepseek-ai/DeepSeek-R1" (and the same block at lines ~128-131) to include a revision field (commit SHA or tag) or point to an internal mirror instead of leaving it unpinned; ensure the YAML entry next to served-model-name/skip-tokenizer-init/trust-remote-code includes the equivalent of SGLang’s --revision value to lock the remote code, or add a short comment explaining why unpinned remote code is acceptable if you choose not to pin.

coderabbitai · 2026-02-03T17:55:03Z

+resources:
+  gpu_type: "gb300"
+  prefill_nodes: 10 
+  prefill_workers: 5
+  decode_nodes: 8 
+  decode_workers: 1
+  gpus_per_node: 4


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and read the configuration file fd "mid.yaml" recipes/

Repository: ishandhanani/srt-slurm

Length of output: 281

🏁 Script executed:

# Read the complete configuration file to understand all parameters cat recipes/gb300-fp8/8k1k/mtp/mid.yaml

Repository: ishandhanani/srt-slurm

Length of output: 5342

🏁 Script executed:

# Search for tp/dp/ep parallelism configuration in related files rg "tp.*dp.*ep|tensor_parallel|data_parallel|expert_parallel" recipes/gb300-fp8/8k1k/mtp/ -A 2 -B 2

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

# Check the other sections mentioned (lines 71-74, 131-134) sed -n '71,74p; 131,134p' recipes/gb300-fp8/8k1k/mtp/mid.yaml

Repository: ishandhanani/srt-slurm

Length of output: 226

🏁 Script executed:

# Compare with other similar config files to understand the pattern cat recipes/gb300-fp8/8k1k/stp/mid.yaml | grep -A 10 "resources:"

Repository: ishandhanani/srt-slurm

Length of output: 250

🏁 Script executed:

# Check if there are any configuration documents or READMEs explaining resource allocation find recipes/ -name "README*" -o -name "*.md" | head -5

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

# Search for how prefill_workers and decode_workers are used in the codebase rg "prefill_workers|decode_workers" --type yaml --type py -B 2 -A 2

Repository: ishandhanani/srt-slurm

Length of output: 50379

🏁 Script executed:

# Get line count to ensure safe extraction wc -l recipes/gb300-fp8/8k1k/mtp/mid.yaml

Repository: ishandhanani/srt-slurm

Length of output: 105

🏁 Script executed:

# Check lines 71-74 and 131-134 sed -n '65,80p; 125,140p' recipes/gb300-fp8/8k1k/mtp/mid.yaml

Repository: ishandhanani/srt-slurm

Length of output: 828

🏁 Script executed:

# Look at the schema.py to understand GPU calculation logic sed -n '1,100p' src/srtctl/core/schema.py | grep -A 20 "gpus_per_prefill\|gpus_per_decode"

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

# Get more context around the parallelism sections sed -n '65,85p; 125,145p' recipes/gb300-fp8/8k1k/mtp/mid.yaml

Repository: ishandhanani/srt-slurm

Length of output: 1047

🏁 Script executed:

# Check if there are any comments or validation logic about parallelism grep -A 5 -B 5 "gpus_per_prefill\|gpus_per_decode" src/srtctl/core/schema.py | head -40

Repository: ishandhanani/srt-slurm

Length of output: 1333

🏁 Script executed:

# Check similar config files to see if this pattern is common grep -l "prefill_workers: 5" recipes/gb300-fp8/*/*.yaml grep -l "tp-size: 32" recipes/gb300-fp8/*/*.yaml

Repository: ishandhanani/srt-slurm

Length of output: 185

🏁 Script executed:

# Find configs with prefill_workers: 5 find recipes/gb300-fp8 -name "*.yaml" -exec grep -l "prefill_workers: 5" {} \;

Repository: ishandhanani/srt-slurm

Length of output: 137

🏁 Script executed:

# Find configs with tp-size: 32 find recipes/gb300-fp8 -name "*.yaml" -exec grep -l "tp-size: 32" {} \;

Repository: ishandhanani/srt-slurm

Length of output: 281

🏁 Script executed:

# Let's examine what the actual total GPUs are for each stage python3 << 'EOF' # Prefill: 10 nodes * 5 workers * 4 GPUs/node = ? prefill_total_gpus = 10 * 4 # Total for prefill stage (not multiplied by workers) prefill_workers = 5 gpus_per_prefill_worker = (10 * 4) // 5 # Decode: 8 nodes * 1 worker * 4 GPUs/node = ? decode_total_gpus = 8 * 4 # Total for decode stage (not multiplied by workers) decode_workers = 1 gpus_per_decode_worker = (8 * 4) // 1 # Parallelism prefill_tp = 8 prefill_dp = 8 prefill_ep = 8 prefill_world_size = prefill_tp * prefill_dp * prefill_ep decode_tp = 32 decode_dp = 32 decode_ep = 32 decode_world_size = decode_tp * decode_dp * decode_ep print(f"Prefill: {prefill_total_gpus} total GPUs, {prefill_workers} workers, {gpus_per_prefill_worker} GPUs/worker") print(f"Prefill parallelism: tp={prefill_tp}, dp={prefill_dp}, ep={prefill_ep}, world_size={prefill_world_size}") print(f"Prefill oversubscription: {prefill_world_size} ranks / {gpus_per_prefill_worker} GPUs per worker = {prefill_world_size / gpus_per_prefill_worker:.1f}x") print() print(f"Decode: {decode_total_gpus} total GPUs, {decode_workers} workers, {gpus_per_decode_worker} GPUs/worker") print(f"Decode parallelism: tp={decode_tp}, dp={decode_dp}, ep={decode_ep}, world_size={decode_world_size}") print(f"Decode oversubscription: {decode_world_size} ranks / {gpus_per_decode_worker} GPUs per worker = {decode_world_size / gpus_per_decode_worker:.1f}x") EOF

Repository: ishandhanani/srt-slurm

Length of output: 408

🏁 Script executed:

# Check the STP version of the same config to compare cat recipes/gb300-fp8/8k1k/stp/mid.yaml | grep -A 20 "resources:"

Repository: ishandhanani/srt-slurm

Length of output: 664

🏁 Script executed:

# Look for any comments or docs about disaggregation and parallelism grep -r "disaggregation" recipes/gb300-fp8/8k1k/mtp/mid.yaml -i

Repository: ishandhanani/srt-slurm

Length of output: 640

🏁 Script executed:

# Check if there are validation warnings or known issues in test files grep -r "oversubscription\|world_size\|512\|32768" tests/ --include="*.py" -B 2 -A 2 | head -40

Repository: ishandhanani/srt-slurm

Length of output: 585

Align tp/dp/ep parallelism with allocated GPU resources.

Prefill allocates 40 GPUs (10 nodes × 4 gpus/node, 5 workers = 8 GPUs/worker) for 512 parallelism ranks (tp=8, dp=8, ep=8), creating 64x oversubscription. Decode allocates 32 GPUs (8 nodes × 4 gpus/node, 1 worker) for 32,768 ranks (tp=32, dp=32, ep=32), creating 1024x oversubscription. Reduce tp/dp/ep values or increase GPU allocation to match the parallelism requirements.

Also applies to: 71-74, 131-134

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/8k1k/mtp/mid.yaml` around lines 13 - 19, The configured tensor/model/data parallelism (tp/dp/ep) is massively larger than the actual GPUs allocated in resources; compute total GPUs as nodes * gpus_per_node * workers for prefill (prefill_nodes, prefill_workers, gpus_per_node) and decode (decode_nodes, decode_workers, gpus_per_node) and either reduce the tp/dp/ep values so their product (tp * dp * ep) does not exceed total GPUs or proportionally increase nodes/workers to match the desired parallelism; update the tp/dp/ep settings referenced around the existing resource blocks (prefill_* and decode_*) and the other occurrences called out (lines ~71-74 and ~131-134) so each parallelism product ≤ allocated GPUs.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml`:
- Line 5: The Docker image tag for SGLang is invalid: replace the container
value "lmsysorg/sglang:v0.5.8-cu130" with the CUDA13 runtime-suffixed tag (e.g.,
"lmsysorg/sglang:v0.5.8-cu130-runtime"); if v0.5.8-cu130-runtime is not
published, use the confirmed available tag
"lmsysorg/sglang:v0.5.7-cu130-runtime" instead and verify against the official
SGLang image tags; update the container field accordingly.

In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml`:
- Around line 153-156: The decode section is missing the max-total-tokens
parameter; add a line under the decode block setting max-total-tokens: 524288
(to match the prefill section's max-total-tokens) so the decode block explicitly
defines the same total-token limit (compare with the prefill max-total-tokens
and the other recipe max-tpt-2p1d-mtp.yaml if you intend a different value).

In `@recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml`:
- Around line 45-64: The decode_environment block is missing the
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM flag alongside the existing
SGLANG_DECODE_BOOTSTRAP_TIMEOUT; add SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" to the
decode_environment entries (or add a comment explaining why it's intentionally
omitted for the 1k8k profile) so the behavior matches the other profile that
includes both SGLANG_DECODE_BOOTSTRAP_TIMEOUT and
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM.

coderabbitai · 2026-02-04T02:34:18Z

 model:
  path: "dsfp8"
-  container: "0.5.8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

SGLang v0.5.8 Docker image tags lmsysorg CUDA 13

💡 Result:

I couldn’t find any published lmsysorg/sglang Docker tag for v0.5.8 that targets CUDA 13 (e.g., no evidence of a v0.5.8-cu13* / v0.5.8-cu130* tag in the public listings I could access).

What is documented/visible for CUDA 13 with lmsysorg/sglang:

Nightly / dev (CUDA 13): lmsysorg/sglang:dev-cu13 (docs.sglang.io)

Stable (CUDA 13.0 runtime line): lmsysorg/sglang:latest-cu130-runtime (docs.sglang.io)

Versioned CUDA 13.0 runtime tag I can confirm exists (latest I saw): lmsysorg/sglang:v0.5.7-cu130-runtime (docker.aityp.com)

If you want, tell me whether you need build image vs -runtime, and I can try to enumerate any additional *-cu13 / *-cu130-* tags that are currently published.

Citations:

1: https://docs.sglang.io/get_started/install.html?utm_source=openai

2: https://docs.sglang.io/get_started/install.html?utm_source=openai

3: https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime?utm_source=openai

🌐 Web query:

Docker Hub lmsysorg/sglang v0.5.8 available tags cu130 runtime

💡 Result:

I couldn’t find any evidence that a v0.5.8-*-cu130-runtime tag exists for lmsysorg/sglang on Docker Hub (no search hits for v0.5.8-cu130-runtime, and the Docker Hub repo page doesn’t surface any v0.5.8 tags). [1]

What is referenced/available for CUDA 13.0 runtime:

Stable (recommended) CUDA 13 runtime tag: lmsysorg/sglang:latest-cu130-runtime (called out in the official install docs for “CUDA 13 environment”). [2]

Latest versioned CUDA 13 runtime tag I could find: lmsysorg/sglang:v0.5.7-cu130-runtime (listed by a Docker Hub mirror page that tracks that exact tag). [3]

Sources:
[1] Docker Hub repo page for lmsysorg/sglang (hub.docker.com)
[2] SGLang install docs mentioning latest-cu130-runtime (docs.sglang.io)
[3] Mirror page showing v0.5.7-cu130-runtime exists (docker.aityp.com)

Citations:

1: https://hub.docker.com/r/lmsysorg/sglang/?utm_source=openai

2: https://docs.sglang.io/get_started/install.html?utm_source=openai

3: https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime?utm_source=openai

🌐 Web query:

SGLang CUDA 13 Docker image runtime variant "-runtime" suffix

💡 Result:

In SGLang’s official Docker Hub images (lmsysorg/sglang), the -runtime suffix means the production “runtime” variant of an image tag (e.g., latest-runtime) that is smaller (~40%) because it excludes build tools and development dependencies. [1]

For CUDA 13, SGLang recommends using either: [1]

Nightly/dev CUDA 13 image: lmsysorg/sglang:dev-cu13

Stable CUDA 13 runtime image: lmsysorg/sglang:latest-cu130-runtime (note cu130 + -runtime)

Related naming in NVIDIA’s Dynamo artifacts also uses a “runtime” image plus a CUDA 13 tag (e.g., sglang-runtime:<version>-cuda13). [2]

Update the SGLang Docker image tag for CUDA 13 with the -runtime suffix.

The tag lmsysorg/sglang:v0.5.8-cu130 does not exist in published repositories. For CUDA 13, SGLang images require the -runtime suffix (e.g., lmsysorg/sglang:v0.5.8-cu130-runtime). The latest confirmed versioned CUDA 13 runtime tag is lmsysorg/sglang:v0.5.7-cu130-runtime. Verify whether v0.5.8 with CUDA 13 runtime has been released; if not, use the v0.5.7 tag or check the official SGLang documentation for the recommended CUDA 13 image.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml` at line 5, The Docker image tag for SGLang is invalid: replace the container value "lmsysorg/sglang:v0.5.8-cu130" with the CUDA13 runtime-suffixed tag (e.g., "lmsysorg/sglang:v0.5.8-cu130-runtime"); if v0.5.8-cu130-runtime is not published, use the confirmed available tag "lmsysorg/sglang:v0.5.7-cu130-runtime" instead and verify against the official SGLang image tags; update the container field accordingly.

coderabbitai · 2026-02-04T02:34:18Z

+      # Memory and token limits
+      mem-fraction-static: 0.75
+      chunked-prefill-size: 36864
+


⚠️ Potential issue | 🟡 Minor

Missing max-total-tokens in decode section.

The prefill section specifies max-total-tokens: 524288 (line 96), but the decode section lacks this parameter. The comparable max-tpt-2p1d-mtp.yaml has max-total-tokens: 1703116 for decode. If this omission is intentional (relying on a default), consider adding a comment; otherwise, add the appropriate value.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml` around lines 153 - 156, The decode section is missing the max-total-tokens parameter; add a line under the decode block setting max-total-tokens: 524288 (to match the prefill section's max-total-tokens) so the decode block explicitly defines the same total-token limit (compare with the prefill max-total-tokens and the other recipe max-tpt-2p1d-mtp.yaml if you intend a different value).

coderabbitai · 2026-02-04T02:34:18Z

+  # Decode-specific environment variables
+  decode_environment:
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+    SGLANG_ENABLE_SPEC_V2: "1"
+    SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH: "1"
+    SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"


⚠️ Potential issue | 🟡 Minor

Missing SGLANG_HACK_SEQ_BOOTSTRAP_ROOM in decode environment.

Comparing to max-tpt-2p1d-mtp.yaml, the decode environment there includes SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" (alongside SGLANG_DECODE_BOOTSTRAP_TIMEOUT). This file has the timeout but lacks the hack flag. If this is intentional for the 1k8k profile, consider adding a comment; otherwise, add it for consistency.

Proposed fix if needed

SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" + SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/max-tpt-mtp.yaml` around lines 45 - 64, The decode_environment block is missing the SGLANG_HACK_SEQ_BOOTSTRAP_ROOM flag alongside the existing SGLANG_DECODE_BOOTSTRAP_TIMEOUT; add SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" to the decode_environment entries (or add a comment explaining why it's intentionally omitted for the 1k8k profile) so the behavior matches the other profile that includes both SGLANG_DECODE_BOOTSTRAP_TIMEOUT and SGLANG_HACK_SEQ_BOOTSTRAP_ROOM.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

recipes/gb200-fp8/1k1k/low-latency-mtp.yaml (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Contradictory JIT DeepGEMM configuration.

SGLANG_ENABLE_JIT_DEEPGEMM is set to "false" (line 24), but SGLANG_JIT_DEEPGEMM_FAST_WARMUP is set to "1" (line 38). Enabling fast warmup for a disabled feature is contradictory and may indicate a configuration oversight.

Please clarify the intent: if JIT DeepGEMM should be used, set SGLANG_ENABLE_JIT_DEEPGEMM: "true"; otherwise, consider removing or disabling the fast warmup flag.

Also applies to: 37-38

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml`:
- Around line 24-25: SGLANG_ENABLE_JIT_DEEPGEMM is currently "false" while
SGLANG_JIT_DEEPGEMM_FAST_WARMUP is "1", which is a no-op when JIT DeepGEMM is
disabled; fix by either setting SGLANG_ENABLE_JIT_DEEPGEMM to "true" to enable
JIT DeepGEMM (so SGLANG_JIT_DEEPGEMM_FAST_WARMUP takes effect) or remove/clear
SGLANG_JIT_DEEPGEMM_FAST_WARMUP so there is no misleading configuration; update
the values for SGLANG_ENABLE_JIT_DEEPGEMM and/or SGLANG_JIT_DEEPGEMM_FAST_WARMUP
to make them consistent.
- Around line 44-45: The decode environment has a configuration mismatch:
SGLANG_ENABLE_JIT_DEEPGEMM is set to "false" while
SGLANG_JIT_DEEPGEMM_FAST_WARMUP is "1"; update the decode env to be consistent
by either setting SGLANG_ENABLE_JIT_DEEPGEMM to "true" to enable JIT DeepGEMM
when keeping SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1", or disable/remove
SGLANG_JIT_DEEPGEMM_FAST_WARMUP if you intend to keep
SGLANG_ENABLE_JIT_DEEPGEMM: "false"; adjust the entries for
SGLANG_ENABLE_JIT_DEEPGEMM and SGLANG_JIT_DEEPGEMM_FAST_WARMUP so they reflect
the same intent.

In `@recipes/gb300-fp8/1k1k/mtp/max.yaml`:
- Line 185: The concurrencies entry is using YAML list syntax (concurrencies:
[4096,7168,7680]) which is inconsistent with other recipes that use a delimited
string; update the concurrencies field to the same string format used elsewhere
(e.g., the delimited form used across recipes) or confirm the benchmark runner
accepts list syntax, then either convert the value for the concurrencies key to
the canonical delimited string or add validation to accept both formats in the
runner; locate and change the concurrencies entry in this recipe (symbol:
concurrencies) to match the project convention.

🧹 Nitpick comments (2)

recipes/gb300-fp8/1k1k/mtp/max.yaml (1)

157-159: Consider adding max-total-tokens for decode.

The prefill section specifies max-total-tokens: 524288, but the decode section only has mem-fraction-static: 0.75 without an explicit token limit. For consistency with other max-throughput recipes (e.g., gb200-fp8/1k1k/max-tpt-2p1d-mtp.yaml line 163), consider adding max-total-tokens to the decode config.

recipes/gb300-fp8/1k8k/mtp/max.yaml (1)

70-73: Confirm trust-remote-code: true is required and pin the model revision.
Line 72 and Line 130 enable remote code execution from the model repo. If this is required, please ensure the repo is vetted and pinned to a fixed revision or internal snapshot to avoid supply‑chain drift.

Also applies to: 128-130

coderabbitai · 2026-02-04T20:52:34Z

+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    SGLANG_ENABLE_FLASHINFER_GEMM: "1"


⚠️ Potential issue | 🟡 Minor

Inconsistent SGLANG_ENABLE_JIT_DEEPGEMM setting.

SGLANG_ENABLE_JIT_DEEPGEMM is set to "false" (line 24), but SGLANG_JIT_DEEPGEMM_FAST_WARMUP is set to "1" (line 38). The fast warmup flag has no effect when JIT DeepGEMM is disabled. Either enable JIT DeepGEMM or remove the fast warmup setting.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml` around lines 24 - 25, SGLANG_ENABLE_JIT_DEEPGEMM is currently "false" while SGLANG_JIT_DEEPGEMM_FAST_WARMUP is "1", which is a no-op when JIT DeepGEMM is disabled; fix by either setting SGLANG_ENABLE_JIT_DEEPGEMM to "true" to enable JIT DeepGEMM (so SGLANG_JIT_DEEPGEMM_FAST_WARMUP takes effect) or remove/clear SGLANG_JIT_DEEPGEMM_FAST_WARMUP so there is no misleading configuration; update the values for SGLANG_ENABLE_JIT_DEEPGEMM and/or SGLANG_JIT_DEEPGEMM_FAST_WARMUP to make them consistent.

coderabbitai · 2026-02-04T20:52:34Z

+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    SGLANG_ENABLE_FLASHINFER_GEMM: "1"


⚠️ Potential issue | 🟡 Minor

Same JIT DeepGEMM inconsistency in decode environment.

The decode environment has the same mismatch: SGLANG_ENABLE_JIT_DEEPGEMM: "false" (line 44) with SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" (line 59).

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/low-latency-mtp.yaml` around lines 44 - 45, The decode environment has a configuration mismatch: SGLANG_ENABLE_JIT_DEEPGEMM is set to "false" while SGLANG_JIT_DEEPGEMM_FAST_WARMUP is "1"; update the decode env to be consistent by either setting SGLANG_ENABLE_JIT_DEEPGEMM to "true" to enable JIT DeepGEMM when keeping SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1", or disable/remove SGLANG_JIT_DEEPGEMM_FAST_WARMUP if you intend to keep SGLANG_ENABLE_JIT_DEEPGEMM: "false"; adjust the entries for SGLANG_ENABLE_JIT_DEEPGEMM and SGLANG_JIT_DEEPGEMM_FAST_WARMUP so they reflect the same intent.

coderabbitai · 2026-02-04T20:52:34Z

+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: [4096,7168,7680]


⚠️ Potential issue | 🟡 Minor

Inconsistent concurrencies format.

This file uses a YAML list [4096,7168,7680] while other recipes use a delimited string format like "4x8x32x64x80x96x112x128". Verify the benchmark runner accepts both formats, or align with the convention used elsewhere.

Example fix (if string format is required)

- concurrencies: [4096,7168,7680] + concurrencies: "4096x7168x7680"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

concurrencies: [4096,7168,7680]

concurrencies: "4096x7168x7680"

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/1k1k/mtp/max.yaml` at line 185, The concurrencies entry is using YAML list syntax (concurrencies: [4096,7168,7680]) which is inconsistent with other recipes that use a delimited string; update the concurrencies field to the same string format used elsewhere (e.g., the delimited form used across recipes) or confirm the benchmark runner accepts list syntax, then either convert the value for the concurrencies key to the canonical delimited string or add validation to accept both formats in the runner; locate and change the concurrencies entry in this recipe (symbol: concurrencies) to match the project convention.

* Add GB300 MTP recipes * gb200 * rest of configs * fixes * gb200: use fast warmup, no dg cache * fix oom on 8k1k prefill. update low latency containers * fixes * Working 8k1k LL MTP. and mid * add nginx * update container for gb300 too

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@recipes/gb300-fp8/1k1k/mtp/max.yaml`:
- Around line 157-159: The decode section is missing the max-total-tokens
setting; add an explicit max-total-tokens: 524288 entry under the decode section
(matching the prefill section) to maintain consistency with other recipes and
avoid implicit limits—update the decode block near mem-fraction-static and
chunked-prefill-size so it includes max-total-tokens: 524288.

🧹 Nitpick comments (5)

recipes/gb300-fp8/1k1k/mtp/mid.yaml (3)
9-10: Consider pinning the nginx container version.

Using nginx without a version tag may lead to inconsistent deployments as the latest tag changes over time.
💡 Suggested fix
 frontend:
-  nginx_container: nginx
+  nginx_container: nginx:1.27
25-25: Hardcoded date in cache directory path may become stale.

SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" contains what appears to be a date (October 21, 2025). This could cause confusion as the configuration ages or if the cache directory structure changes.

Also applies to: 44-44

116-120: Comment says "MTP" but configuration uses EAGLE algorithm.

The comment on line 116 indicates "MTP" (Multi-Token Prediction), but speculative-algorithm is set to "EAGLE". These are different speculative decoding approaches. If this is intentional (using EAGLE for MTP-style speculation), consider clarifying the comment to avoid confusion.
💡 Suggested clarification
-      # MTP
+      # Speculative decoding (EAGLE)
       speculative-algorithm: "EAGLE"
Or if MTP is the intended approach:
       # MTP
-      speculative-algorithm: "EAGLE"
+      speculative-algorithm: "MTP"
recipes/gb300-fp8/1k1k/mtp/max.yaml (1)

26-26: Hardcoded date-based cache directory path.

SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" uses a date-stamped path that may become stale or cause confusion. Consider parameterizing this or documenting the versioning strategy.

Also applies to: 45-45
recipes/gb300-fp8/8k1k/mtp/mid.yaml (1)
10-11: Pin the nginx image to a version or digest for reproducibility.
Unpinned tags can change behavior across deploys.
✅ Suggested adjustment (use your approved version/digest)
-  nginx_container: nginx
+  nginx_container: "nginx:<pinned-version-or-digest>"

coderabbitai · 2026-02-04T20:56:54Z

+      # Memory and token limits
+      mem-fraction-static: 0.75
+      chunked-prefill-size: 36864


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check all max-total-tokens occurrences in gb300-fp8 configs rg -n "max-total-tokens" recipes/gb300-fp8/ -A2 -B2 | head -100

Repository: ishandhanani/srt-slurm

Length of output: 6003

🏁 Script executed:

#!/bin/bash # Also check the actual file to see the full context echo "=== Full decode section context ===" sed -n '123,179p' recipes/gb300-fp8/1k1k/mtp/max.yaml

Repository: ishandhanani/srt-slurm

Length of output: 1821

Decode section missing max-total-tokens setting.

The prefill section specifies max-total-tokens: 524288 but the decode section omits this setting. All other configuration files in the repository include this in their prefill sections. Add the explicit configuration for consistency, or document if intentionally omitted.

🤖 Prompt for AI Agents

In `@recipes/gb300-fp8/1k1k/mtp/max.yaml` around lines 157 - 159, The decode section is missing the max-total-tokens setting; add an explicit max-total-tokens: 524288 entry under the decode section (matching the prefill section) to maintain consistency with other recipes and avoid implicit limits—update the decode block near mem-fraction-static and chunked-prefill-size so it includes max-total-tokens: 524288.

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

trevor-m changed the title ~~Add GB300 MTP recipes~~ Add all GB200/GB300 MTP recipes Feb 3, 2026

trevor-m changed the title ~~Add all GB200/GB300 MTP recipes~~ Add all GB200/GB300 FP8 MTP recipes Feb 3, 2026

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

trevor-m force-pushed the trevor-m/gb300-mtp branch from c6da362 to 8ca02d6 Compare February 3, 2026 17:47

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

trevor-m added 9 commits February 4, 2026 12:42

Add GB300 MTP recipes

c3a657a

gb200

a834cd2

rest of configs

503e5d0

fixes

3b94448

gb200: use fast warmup, no dg cache

6b5fb56

fix oom on 8k1k prefill. update low latency containers

377fce0

fixes

95b1a22

Working 8k1k LL MTP. and mid

f1a43bb

add nginx

d6f43da

trevor-m force-pushed the trevor-m/gb300-mtp branch from 6230ad0 to d6f43da Compare February 4, 2026 20:45

update container for gb300 too

7146d97

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

ishandhanani merged commit 43644d8 into main Feb 4, 2026
3 of 5 checks passed

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Feb 6, 2026

Remove moe_dense_tp_size for TP configs #153

Merged

		# GB200 FP8 Max Throughput Configuration

		name: "gb200-fp8-mid-curve-3p1d-mtp"

		SGLANG_ENABLE_JIT_DEEPGEMM: "false"
		SGLANG_ENABLE_FLASHINFER_GEMM: "1"

	concurrencies: [4096,7168,7680]
	concurrencies: "4096x7168x7680"

Conversation

trevor-m commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

SGLang v0.5.8 (Jan 23, 2026) — upstream release notes

CUDA 13.0 (“cu130”) + GB300 + FP8 — release notes / support notes

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

What trust_remote_code=True does (and why it’s risky)

Threat model / concrete attack paths

Best practices (practical checklist)

Key takeaway

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

How they relate (the key semantic rule)

Important EP constraint (common gotcha)

Uh oh!

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

trevor-m commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

What `trust_remote_code=True` does (and why it’s risky)