Update gb200 recipes by kyleliang-nv · Pull Request #130 · ishandhanani/srt-slurm

kyleliang-nv · 2026-02-02T22:58:59Z

Summary by CodeRabbit

Updates
- Model runtime bumped to v0.5.8 and Dynamo to 0.8.1; many recipe names standardized and resource/parallelism defaults tuned.
New Features
- Multi-frontend support enabled widely with an nginx frontend option.
- New backend tuning: disaggregation transfer backends and selectable FP4/FP8 GEMM backends added.
Cleanup
- Removed legacy/deprecated environment flags and simplified benchmark concurrencies.

coderabbitai · 2026-02-02T22:59:13Z

📝 Walkthrough

Walkthrough

Standardizes and renames multiple GB200 FP4/FP8 recipe YAMLs: bumps dynamo to 0.8.1, enables multi-frontends (adds nginx_container and num_additional_frontends), updates model.container to lmsysorg/sglang:v0.5.8-cu130(-runtime), removes legacy SGLANG envs, and adds fp*-gemm-backend and disaggregation-transfer-backend settings; benchmark concurrencies adjusted.

Changes

Cohort / File(s)	Summary
GB200 FP4 (1k1k) `recipes/gb200-fp4/1k1k/low-latency.yaml`, `recipes/gb200-fp4/1k1k/max-tpt.yaml`, `recipes/gb200-fp4/1k1k/mid-curve.yaml`	Renamed to 1k1k variants; added `dynamo.version: 0.8.1` and frontend multi-frontend keys (`enable_multiple_frontends`, `num_additional_frontends`, `nginx_container`); upgraded `model.container`; removed deprecated SGLANG envs; added `fp4-gemm-backend` and `disaggregation-transfer-backend: nixl`; adjusted benchmark concurrencies.
GB200 FP4 (1k8k) `recipes/gb200-fp4/1k8k/low-latency.yaml`, `recipes/gb200-fp4/1k8k/max-tpt.yaml`, `recipes/gb200-fp4/1k8k/mid-curve.yaml`	Name/identity updates; dynamo 0.8.1 and frontend multi-frontends + nginx; model.container bumped; removed legacy bootstrap/env flags; added `fp4-gemm-backend` (variants) and `disaggregation-transfer-backend` across sglang_config prefill/decode blocks; benchmark concurrencies tuned.
GB200 FP4 (8k1k) `recipes/gb200-fp4/8k1k/low-latency.yaml`, `recipes/gb200-fp4/8k1k/max-tpt.yaml`, `recipes/gb200-fp4/8k1k/mid-curve.yaml`	Standardized names; dynamo 0.8.1 and multi-frontends; model.container bump; removed legacy SGLANG envs; added `fp4-gemm-backend` and `disaggregation-transfer-backend` in prefill/decode/backend sections; adjusted concurrencies and some resource fields.
GB200 FP8 (1k1k) `recipes/gb200-fp8/1k1k/low-latency.yaml`, `recipes/gb200-fp8/1k1k/max-tpt.yaml`, `recipes/gb200-fp8/1k1k/mid-curve.yaml`	Renamed to 1k1k variants; added dynamo/frontend multi-frontends and nginx; model.path -> `dsr1-fp8`, container upgraded; replaced DG cache dir with `SGLANG_JIT_DEEPGEMM_FAST_WARMUP`; removed legacy flags; added `fp8-gemm-backend` and `disaggregation-transfer-backend`; adjusted max-running-requests and concurrencies.
GB200 FP8 (1k8k) `recipes/gb200-fp8/1k8k/*`	New/expanded 1k8k configs: Dynamo multi-frontend + nginx, model `dsr1-fp8` with upgraded container, detailed `backend` and `sglang_config` for prefill/decode (DeepEP/MOE, CUDA graph, disaggregation), added `fp8-gemm-backend` and `disaggregation-transfer-backend`, and sa-bench benchmarks.
GB200 FP8 (8k1k) `recipes/gb200-fp8/8k1k/*`	Renamed/added 8k1k configs; dynamo 0.8.1 + multi-frontends; model -> `dsr1-fp8` and container v0.5.8; replaced DG cache dir handling; removed legacy SGLANG flags; added `fp8-gemm-backend` and `disaggregation-transfer-backend`; increased some resources/parallelism and tuned benchmark concurrencies.

Sequence Diagram(s)

(omitted — changes are configuration updates not introducing new multi-component control flow warranting a sequence diagram)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

gb300 - cleanups #126 — updates model.container to lmsysorg/sglang:v0.5.8-cu130-runtime and adds frontend.nginx_container across recipes (strong overlap).
Update recipe for fp4+Wideep+MTP on GB200 #62 — modifies GB200 FP4 recipes' sglang_config and backend env settings (FP4 GEMM/disaggregation) overlapping these edits.
Add low-latency FP8 disagg with MTP for GB200 #116 — updates gb200-fp8 1k1k low-latency config (container bump and FP8/disaggregation runtime changes), closely related to FP8 changes here.

Suggested reviewers

ishandhanani

Poem

🐇 I hop through YAML fields with a twitch and a grin,
I swap containers, prune envs, and tuck old flags in.
I stitch nginx frontends and tell nixl to steer,
GEMM backends chosen, concurrencies made clear.
A rabbit's cheer — configs neat, ready to run and win!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Update gb200 recipes' is vague and generic, using non-descriptive language that does not convey meaningful information about the specific changes in the changeset.	Consider using a more specific title that highlights the main change, such as 'Update gb200 recipes to use sglang v0.5.8 and dynamo 0.8.1' or similar.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kylliang/update_gb200_recipes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp4/1k1k/low-latency.yaml`:
- Around line 81-82: Replace the invalid config key fp4-gemm-backend with the
correct moe-runner-backend: keep disaggregation-transfer-backend: nixl as-is,
remove or rename the fp4-gemm-backend entry and add moe-runner-backend:
flashinfer_trtllm so the backend is configured using the supported key (update
the entry that currently reads fp4-gemm-backend to moe-runner-backend and retain
the flashinfer_trtllm value).

In `@recipes/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 1-4: The dynamo version specified under the "dynamo" block
(dynamo.version: 0.8.1) is invalid on PyPI; update the version to a real release
(e.g., change dynamo.version to 0.8.0) in this recipe (name:
"gb200-fp4-1k8k-low-latency") and apply the same correction to other GB200
recipes that reuse the erroneous 0.8.1 value so the configuration matches an
actual PyPI release.

In `@recipes/gb200-fp8/1k1k/low-latency.yaml`:
- Line 46: The SGLANG_DG_CACHE_DIR value in the decode_environment entry
currently has a typo ("/configsdg-0.5.8_cu13"); update the environment variable
SGLANG_DG_CACHE_DIR to use the correct path "/configs/dg-0.5.8_cu13" so it
matches other recipe files and consistent cache directory naming.
- Line 29: The SGLANG_DG_CACHE_DIR value contains a typo: change the path in the
SGLANG_DG_CACHE_DIR entry from "/configsdg-0.5.8_cu13" to
"/configs/dg-0.5.8_cu13" so it matches the other recipes (e.g., mid-curve.yaml,
max-tpt.yaml) and restores the missing slash between "configs" and "dg".

In `@recipes/gb200-fp8/1k8k/max_tpt.yaml`:
- Line 12: The inline comment for the YAML key num_additional_frontends is
truncated; update the comment associated with num_additional_frontends to a
complete sentence such as "# Additional routers (total = 1 +
num_additional_frontends)" or "# Additional routers (total routers = 1 +
num_additional_frontends)" so it clearly explains that the total router count
equals the base 1 plus the value of num_additional_frontends.
- Line 1: The configuration 'name' field is incorrect (it reads
"gb200-fp8-1k1k-max-tpt") and must match the benchmark settings (isl: 1024, osl:
8192); update the name value to "gb200-fp8-1k8k-max-tpt" in this YAML (modify
the name key at the top of the file) so the config name reflects the 1k8k setup.
- Around line 70-174: The prefill and decode blocks in sglang_config are missing
the fp8-gemm-backend setting; add fp8-gemm-backend: "flashinfer_trtllm" inside
both the prefill and decode sections (near other FP8/attention settings such as
kv-cache-dtype and attention-backend) so that the fp8 GEMM backend is configured
for functions that rely on fp8 GEMM behavior.

In `@recipes/gb200-fp8/8k1k/max_tpt.yaml`:
- Around line 114-115: The YAML contains duplicate keys named
"disaggregation-bootstrap-port" within the prefill and decode mappings causing
potential collisions; remove the redundant "disaggregation-bootstrap-port"
entries so each mapping has a single port key, keeping the intended value only
once (check the prefill and decode mappings where
"disaggregation-bootstrap-port" appears twice and delete the duplicate
occurrences), and verify there are no other duplicate keys like
"disaggregation-transfer-backend" in those mappings.

In `@recipes/gb200-fp8/8k1k/mid-curve.yaml`:
- Around line 113-114: There are duplicate disaggregation-bootstrap-port entries
in the YAML (appearing in both the prefill and decode sections); remove the
duplicate key so each section contains a single unique
disaggregation-bootstrap-port entry and ensure the corresponding
disaggregation-transfer-backend (e.g., nixl) remains paired with the intended
port; update the prefill/decode blocks (where keys repeat) to keep only one
consistent declaration of disaggregation-bootstrap-port and remove the redundant
one.

🧹 Nitpick comments (8)

recipes/gb200-fp4/1k8k/max-tpt.yaml (1)

26-62: Consider extracting common environment variables to reduce duplication.

The prefill_environment and decode_environment sections share many identical environment variables (lines 27-41 and 45-59). Consider extracting common variables to a shared section if the configuration format supports YAML anchors or templating, to reduce maintenance burden.
recipes/gb200-fp4/8k1k/mid-curve.yaml (1)
106-107: Minor: Consider quoting nixl for consistency.

The disaggregation-transfer-backend: nixl value is unquoted while other string values in this file are quoted (e.g., "flashinfer_cutlass"). While valid YAML, quoting would improve consistency.
Suggested fix
       fp4-gemm-backend: "flashinfer_cutlass"
-      disaggregation-transfer-backend: nixl
+      disaggregation-transfer-backend: "nixl"
recipes/gb200-fp4/8k1k/low-latency.yaml (1)

82-83: Add a comment explaining why low-latency profile uses flashinfer_trtllm instead of flashinfer_cutlass.

All low-latency recipes across all profiles (8k1k, 1k8k, 1k1k) use flashinfer_trtllm, while mid-curve and max-tpt consistently use flashinfer_cutlass. This intentional difference would benefit from a clarifying comment in the YAML to explain the architectural rationale.
recipes/gb200-fp4/1k8k/low-latency.yaml (2)
6-9: Minor: Trailing whitespace on lines 6-7.

Lines 6 and 7 appear to have trailing whitespace after frontend: and type: dynamo. While this won't affect YAML parsing, it's a minor style inconsistency.

The reduction from 4 to 3 additional frontends should be validated against expected load for the 1k8k configuration.
🧹 Suggested cleanup
-frontend:  
-  type: dynamo  
+frontend:
+  type: dynamo
82-83: LGTM on new backend configurations.

The addition of fp4-gemm-backend: "flashinfer_trtllm" and disaggregation-transfer-backend: nixl to both prefill and decode sections is consistent and aligns with the removal of SGLANG_ENABLE_FLASHINFER_GEMM from the environment variables. Explicit configuration is cleaner than environment variable flags.

One minor inconsistency: nixl is unquoted while other string values like "flashinfer_trtllm" are quoted. Consider quoting for consistency:
-      disaggregation-transfer-backend: nixl
+      disaggregation-transfer-backend: "nixl"
Also applies to: 106-107
recipes/gb200-fp8/1k8k/max_tpt.yaml (1)
16-18: Consider removing commented-out container lines.

Multiple commented-out container versions are left in the file. If these are no longer needed for reference, consider removing them to keep the configuration clean.
Proposed cleanup
 model:
   path: "dsfp8"
-#  container: "sglang0p5p5ppost2"
-#  container: "sglang0p5p7"
-#  container: "sglang0p5p8"
   container: "sglang0p5p8_cu13"
   precision: "fp8"
recipes/gb200-fp8/1k8k/mid_curve.yaml (1)
26-60: Optional: dedupe shared env vars if YAML merge keys are supported.
Both environments are nearly identical; anchors reduce drift across prefill/decode configs.
♻️ Possible YAML-anchor refactor
-  prefill_environment:
+  prefill_environment: &common_environment
     TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
     SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
     DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    MC_TE_METRIC: "true"
+    MC_TE_METRIC: "true"
     SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
     SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
     SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
     SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
     MC_FORCE_MNNVL: "1"
     NCCL_MNNVL_ENABLE: "1"
     NCCL_CUMEM_ENABLE: "1"
     SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
     SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
     PYTHONUNBUFFERED: "1"

   # Decode-specific environment variables
   decode_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    <<: *common_environment
     SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768"
-    MC_TE_METRIC: "true"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
     SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
-    PYTHONUNBUFFERED: "1"
recipes/gb200-fp8/1k8k/low_latency.yaml (1)
24-58: Consider YAML anchors to keep prefill/decode env in sync.
The two blocks are identical aside from SGLANG_DECODE_BOOTSTRAP_TIMEOUT; anchoring reduces drift.
♻️ Suggested refactor
 backend:
-  prefill_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
-    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
-    MC_TE_METRIC: "true"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-
-  decode_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
-    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
-    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
-    MC_TE_METRIC: "true"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
+  prefill_environment: &common_environment
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
+    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+
+  decode_environment:
+    <<: *common_environment
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"

coderabbitai · 2026-02-02T23:06:45Z

+      disaggregation-transfer-backend: nixl
+      fp4-gemm-backend: "flashinfer_trtllm"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sglang v0.5.8 flashinfer_trtllm backend configuration

💡 Result:

In SGLang v0.5.8, flashinfer_trtllm is configured as the MoE runner backend via --moe-runner-backend (it’s not the attention backend flag). The valid options list includes flashinfer_trtllm. [1]

CLI example (as shown in the SGLang docs)

python3 -m sglang.launch_server \ --model-path nvidia/DeepSeek-R1-FP4 \ --tp 8 \ --attention-backend trtllm_mla \ --moe-runner-backend flashinfer_trtllm \ --quantization modelopt_fp4 \ --prefill-attention-backend fa4

This is an example of “hybrid attention” (different prefill vs decode backends) while using flashinfer_trtllm for the MoE runner. [2]

YAML config equivalent

model-path: nvidia/DeepSeek-R1-FP4 tp: 8 attention-backend: trtllm_mla moe-runner-backend: flashinfer_trtllm quantization: modelopt_fp4 prefill-attention-backend: fa4

Launch with:

python3 -m sglang.launch_server --config config.yaml

(Config-file support is documented; CLI args override YAML.) [1]

Sources: Server arguments (MoE + config support) [1]; Attention backend doc showing flashinfer_trtllm usage in an example command [2].

🌐 Web query:

sglang v0.5.8 disaggregation-transfer-backend nixl options

💡 Result:

In SGLang v0.5.8, the NIXL “options” for PD disaggregation are mainly (a) selecting NIXL as the transfer backend, plus (b) the general PD-disaggregation knobs (there aren’t extra SGLang CLI flags that are NIXL-specific in the docs).

1) Enable NIXL transfer backend (PD disaggregation)

Use:

--disaggregation-transfer-backend nixl (supported values: mooncake, nixl, ascend, fake) [1]
and run separate servers:

--disaggregation-mode prefill

--disaggregation-mode decode [1][2]

Docs example (single node) uses exactly these flags with NIXL. [2]

2) PD-disaggregation knobs you can still use with NIXL

These are the PD-disaggregation CLI arguments listed by SGLang (apply regardless of backend selection): [1]

--disaggregation-bootstrap-port (prefill bootstrap server port)

--disaggregation-decode-tp (set on prefill server; decode TP size)

--disaggregation-decode-dp (set on prefill server; decode DP size)

--disaggregation-prefill-pp (set on decode server; prefill PP size)

--disaggregation-decode-enable-offload-kvcache

--disaggregation-decode-enable-fake-auto

--num-reserved-decode-tokens

--disaggregation-decode-polling-interval

--disaggregation-ib-device exists, but its description is tied to auto-detection for the mooncake backend (not described as a NIXL control). [1]

3) Install/bring NIXL (dependency)

SGLang’s PD-disaggregation doc says to install NIXL via pip (or build from source, e.g., if you already have UCX installed). [2]

Sources:
[1] SGLang “Server Arguments” (PD disaggregation section). (docs.sglang.io)
[2] SGLang “PD Disaggregation” (NIXL requirements + usage examples). (docs.sglang.io)

Citations:

1: https://docs.sglang.io/advanced_features/server_arguments.html?utm_source=openai

2: https://docs.sglang.io/advanced_features/pd_disaggregation.html?utm_source=openai

🌐 Web query:

sglang 0.5.8 release notes backend options

💡 Result:

SGLang v0.5.8 was released Jan 23, 2026 (PyPI upload date). [1]

v0.5.8 release notes (backend-related items)

SGLang-Diffusion: can run sglang-diffusion with the diffusers backend (plus multi-LoRA inference, “SLA attention backends”, CLI warmup switch, ComfyUI plugin, and broad perf improvements). [2]

Flash Attention 4: adds Flash Attention 4 decoding kernels. [2]

“Backend options” you can configure (runtime/kernel backends)

From SGLang server CLI args (see python -m sglang.launch_server --help), the main backend-selector knobs are: [3]

--model-impl: auto / sglang / transformers (model implementation choice). [3]

--attention-backend: attention kernel backend. [3]

prefill_attention_backend and decode_attention_backend (experimental; override attention backend separately for prefill vs decode). [3]

--sampling-backend: sampling kernel backend. [3]

--grammar-backend: backend for grammar-guided decoding. [3]

--mm-attention-backend: multimodal attention backend. [3]

Sources: PyPI sglang 0.5.8 release entry [1]; GitHub release notes for v0.5.8 [2]; SGLang “Server Arguments” docs (kernel backend flags) [3].

Use moe-runner-backend: flashinfer_trtllm instead of fp4-gemm-backend: flashinfer_trtllm.

In SGLang v0.5.8, flashinfer_trtllm is a valid backend value but must be configured via the moe-runner-backend option, not fp4-gemm-backend. The config key fp4-gemm-backend does not exist in the official SGLang configuration schema. The disaggregation-transfer-backend: nixl is valid and correct.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp4/1k1k/low-latency.yaml` around lines 81 - 82, Replace the invalid config key fp4-gemm-backend with the correct moe-runner-backend: keep disaggregation-transfer-backend: nixl as-is, remove or rename the fp4-gemm-backend entry and add moe-runner-backend: flashinfer_trtllm so the backend is configured using the supported key (update the entry that currently reads fp4-gemm-backend to moe-runner-backend and retain the flashinfer_trtllm value).

coderabbitai · 2026-02-02T23:06:45Z

+name: "gb200-fp4-1k8k-low-latency"

 dynamo:
-  version: 0.7.0
+  version: 0.8.1


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if dynamo 0.8.1 exists on PyPI curl -s "https://pypi.org/pypi/dynamo/0.8.1/json" | head -c 200

Repository: ishandhanani/srt-slurm

Length of output: 90

🏁 Script executed:

# Search for other dynamo version references in the repository rg "dynamo:" --type yaml -A 1 | head -50

Repository: ishandhanani/srt-slurm

Length of output: 3082

🏁 Script executed:

cat recipes/gb200-fp4/1k8k/low-latency.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 405

🏁 Script executed:

# Search for all dynamo version specifications with "version:" pattern rg "dynamo:" -A 2 -B 0 | rg "version:|hash:|install:" | sort -u

Repository: ishandhanani/srt-slurm

Length of output: 28548

Dynamo version 0.8.1 does not exist on PyPI and must be corrected.

While the experiment name follows the standardized convention, the dynamo configuration is invalid. The version 0.8.1 is not a valid release on PyPI (PyPI returned "Not Found"). Although this version is consistently used across multiple GB200 recipes in this PR, the underlying version must be corrected to an actual release (e.g., 0.8.0 as referenced in docs/config-reference.md) before these configurations can be used.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp4/1k8k/low-latency.yaml` around lines 1 - 4, The dynamo version specified under the "dynamo" block (dynamo.version: 0.8.1) is invalid on PyPI; update the version to a real release (e.g., change dynamo.version to 0.8.0) in this recipe (name: "gb200-fp4-1k8k-low-latency") and apply the same correction to other GB200 recipes that reuse the erroneous 0.8.1 value so the configuration matches an actual PyPI release.

coderabbitai · 2026-02-02T23:06:45Z

    PYTHONUNBUFFERED: "1"
    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13"


⚠️ Potential issue | 🟡 Minor

Likely typo in cache directory path.

The path /configsdg-0.5.8_cu13 appears to be missing a slash between configs and dg. Other files in this PR (mid-curve.yaml, max-tpt.yaml) use /configs/dg-0.5.8_cu13.

Proposed fix

- SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13" + SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13"

SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/low-latency.yaml` at line 29, The SGLANG_DG_CACHE_DIR value contains a typo: change the path in the SGLANG_DG_CACHE_DIR entry from "/configsdg-0.5.8_cu13" to "/configs/dg-0.5.8_cu13" so it matches the other recipes (e.g., mid-curve.yaml, max-tpt.yaml) and restores the missing slash between "configs" and "dg".

coderabbitai · 2026-02-02T23:06:45Z

    PYTHONUNBUFFERED: "1"
    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
+    SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13"


⚠️ Potential issue | 🟡 Minor

Same cache directory typo in decode_environment.

This path should also be /configs/dg-0.5.8_cu13 for consistency with other recipe files.

Proposed fix

- SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13" + SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13"

SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/low-latency.yaml` at line 46, The SGLANG_DG_CACHE_DIR value in the decode_environment entry currently has a typo ("/configsdg-0.5.8_cu13"); update the environment variable SGLANG_DG_CACHE_DIR to use the correct path "/configs/dg-0.5.8_cu13" so it matches other recipe files and consistent cache directory naming.

coderabbitai · 2026-02-02T23:06:45Z

@@ -0,0 +1,182 @@
+name: "gb200-fp8-1k1k-max-tpt"


⚠️ Potential issue | 🟠 Major

Configuration name does not match the file path.

The file is located at recipes/gb200-fp8/1k8k/max_tpt.yaml but the name is set to gb200-fp8-1k1k-max-tpt. The benchmark section confirms this is a 1k8k configuration (isl: 1024, osl: 8192), so the name should be corrected.

Proposed fix

-name: "gb200-fp8-1k1k-max-tpt" +name: "gb200-fp8-1k8k-max-tpt"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

name: "gb200-fp8-1k1k-max-tpt"

name: "gb200-fp8-1k8k-max-tpt"

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/max_tpt.yaml` at line 1, The configuration 'name' field is incorrect (it reads "gb200-fp8-1k1k-max-tpt") and must match the benchmark settings (isl: 1024, osl: 8192); update the name value to "gb200-fp8-1k8k-max-tpt" in this YAML (modify the name key at the top of the file) so the config name reflects the 1k8k setup.

coderabbitai · 2026-02-02T23:06:45Z

+frontend:
+  type: dynamo
+  enable_multiple_frontends: true     # Enable nginx + multiple routers  
+  num_additional_frontends: 9         # Additional routers (total = 1 + t


⚠️ Potential issue | 🟡 Minor

Truncated comment.

The comment appears to be cut off mid-sentence: "# Additional routers (total = 1 + t". Please complete the comment.

Proposed fix

- num_additional_frontends: 9 # Additional routers (total = 1 + t + num_additional_frontends: 9 # Additional routers (total = 1 + this value = 10)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

num_additional_frontends: 9 # Additional routers (total = 1 + t

num_additional_frontends: 9 # Additional routers (total = 1 + this value = 10)

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/max_tpt.yaml` at line 12, The inline comment for the YAML key num_additional_frontends is truncated; update the comment associated with num_additional_frontends to a complete sentence such as "# Additional routers (total = 1 + num_additional_frontends)" or "# Additional routers (total routers = 1 + num_additional_frontends)" so it clearly explains that the total router count equals the base 1 plus the value of num_additional_frontends.

coderabbitai · 2026-02-02T23:06:45Z

+  sglang_config:
+    prefill:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 8
+      dp-size: 8
+      ep-size: 8 
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      max-running-requests: 30000
+      context-length: 10000
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Prefill-specific mode
+      disaggregation-mode: "prefill"
+
+      # Memory and token limits
+      mem-fraction-static: 0.75
+      max-total-tokens: 524288
+      chunked-prefill-size: 131072
+
+      # Request handling
+      load-balance-method: "round_robin"
+
+      # Performance optimizations
+      disable-cuda-graph: true
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "normal"
+      ep-dispatch-algorithm: "dynamic"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      disaggregation-transfer-backend: nixl
+
+    decode:
+      # Model configuration
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      skip-tokenizer-init: true
+      trust-remote-code: true
+
+      # Parallelism
+      tp-size: 32
+      dp-size: 32
+      ep-size: 32
+      enable-dp-attention: true
+
+      # KV cache and attention
+      attention-backend: "trtllm_mla"
+      kv-cache-dtype: "fp8_e4m3"
+
+      # Radix cache disabled
+      disable-radix-cache: true
+
+      # Other flags
+      stream-interval: 50
+      decode-log-interval: 1000
+      max-running-requests: 45000
+      context-length: 10000
+      watchdog-timeout: 1000000
+      disable-shared-experts-fusion: true
+      eplb-algorithm: "deepseek"
+      disaggregation-bootstrap-port: 30001
+
+      # Decode-specific mode
+      disaggregation-mode: "decode"
+
+      # Memory and token limits
+      mem-fraction-static: 0.82
+      chunked-prefill-size: 36864
+
+      # DeepEP configuration
+      moe-a2a-backend: "deepep"
+      deepep-mode: "low_latency"
+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true
+      ep-num-redundant-experts: 32
+      deepep-config: "/configs/deepep_config.json"
+
+      # CUDA graphs
+      cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768]
+      cuda-graph-max-bs: 768
+
+      disaggregation-transfer-backend: nixl


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "fp8-gemm-backend" recipes/

Repository: ishandhanani/srt-slurm

Length of output: 1546

🏁 Script executed:

sed -n '70,120p' recipes/gb200-fp8/1k8k/low_latency.yaml

Repository: ishandhanani/srt-slurm

Length of output: 1666

Add missing fp8-gemm-backend configuration to both prefill and decode sections.

The fp8-gemm-backend: "flashinfer_trtllm" configuration is missing from this file. Similar GB200 FP8 recipes (e.g., gb200-fp8/1k8k/low_latency.yaml) include this setting in both prefill and decode sections.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/max_tpt.yaml` around lines 70 - 174, The prefill and decode blocks in sglang_config are missing the fp8-gemm-backend setting; add fp8-gemm-backend: "flashinfer_trtllm" inside both the prefill and decode sections (near other FP8/attention settings such as kv-cache-dtype and attention-backend) so that the fp8 GEMM backend is configured for functions that rely on fp8 GEMM behavior.

coderabbitai · 2026-02-02T23:06:45Z

+      disaggregation-bootstrap-port: 30001
+      disaggregation-transfer-backend: nixl


⚠️ Potential issue | 🟠 Major

Deduplicate disaggregation-bootstrap-port to avoid YAML key collisions.
The key appears twice in both prefill and decode mappings; YAML parsers may error or last-one-wins, risking unintended config.

🧹 Proposed fix (remove the duplicate entries)

@@ - disaggregation-bootstrap-port: 30001 disaggregation-transfer-backend: nixl @@ - disaggregation-bootstrap-port: 30001 disaggregation-transfer-backend: nixl

Also applies to: 166-167

🧰 Tools

🪛 YAMLlint (1.38.0)

[error] 114-114: duplication of key "disaggregation-bootstrap-port" in mapping

(key-duplicates)

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/8k1k/max_tpt.yaml` around lines 114 - 115, The YAML contains duplicate keys named "disaggregation-bootstrap-port" within the prefill and decode mappings causing potential collisions; remove the redundant "disaggregation-bootstrap-port" entries so each mapping has a single port key, keeping the intended value only once (check the prefill and decode mappings where "disaggregation-bootstrap-port" appears twice and delete the duplicate occurrences), and verify there are no other duplicate keys like "disaggregation-transfer-backend" in those mappings.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp4/1k8k/max-tpt.yaml`:
- Around line 61-63: Remove the legacy env var
SGLANG_FLASHINFER_FP4_GEMM_BACKEND from decode_environment to avoid
conflicting/ambiguous configuration with the newer sglang_config key
fp4-gemm-backend; update the decode_environment block (where
SGLANG_FLASHINFER_FP4_GEMM_BACKEND is set) so that only fp4-gemm-backend in
sglang_config controls the GEMM backend, and ensure prefill_environment and
decode_environment are symmetric (either both omit the legacy env var or both
include it only if you intend to keep backward compatibility).

In `@recipes/gb200-fp8/1k1k/max-tpt.yaml`:
- Around line 30-31: Replace all occurrences of the old cache directory string
"/configs/dg-10212025" with the new "/configs/dg-0.5.8_cu13" in every recipe and
example that references it, and fix the typo "/configsdg-0.5.8_cu13" to
"/configs/dg-0.5.8_cu13"; specifically search for environment variable
assignments like SGLANG_DG_CACHE_DIR and any YAML values that contain the cache
path and update them to the correct value to ensure consistency across recipes
(including the gb300/h200/h100 variants and examples).

In `@recipes/gb200-fp8/1k8k/low_latency.yaml`:
- Around line 72-76: The benchmark requests require 1024 (isl) + 8192 (osl) =
9216 tokens but max-total-tokens is set to 8192; update the max-total-tokens
field to at least 9216 (or up to 10000 to match context-length) so caching can
accommodate the full request—modify the max-total-tokens value in the same YAML
block containing context-length, disaggregation-mode, mem-fraction-static, and
chunked-prefill-size.
- Around line 12-15: The container image tag "v0.5.8-cu130-runtime" under the
model block (container: "lmsysorg/sglang:v0.5.8-cu130-runtime") is invalid and
will cause image-pull failures; update the container value to a valid tag such
as "lmsysorg/sglang:v0.5.7-cu130-runtime" or the stable
"lmsysorg/sglang:latest-cu130-runtime" while keeping the model.path ("dsr1-fp8")
and precision ("fp8") entries unchanged.

🧹 Nitpick comments (9)

recipes/gb200-fp8/1k1k/low-latency.yaml (2)
88-89: Minor formatting and consistency issues.

Line 88 appears to be an unnecessary blank line within the YAML mapping.

The value nixl is unquoted, while other similar string values like "flashinfer_trtllm" are quoted.

For consistency, consider quoting nixl and removing the blank line.
Proposed fix
       moe-dense-tp-size: 1
       tensor-parallel-size: 4
       data-parallel-size: 1
       expert-parallel-size: 1
- 
-      disaggregation-transfer-backend: nixl
+      disaggregation-transfer-backend: "nixl"
114-116: Same formatting inconsistency as prefill section.

The value nixl is unquoted and there's a blank line (115) within the mapping. For consistency with other string values, consider quoting nixl and removing the blank line.
Proposed fix
       tensor-parallel-size: 4
       data-parallel-size: 1
       expert-parallel-size: 1
       fp8-gemm-backend: "flashinfer_trtllm"
-
-      disaggregation-transfer-backend: nixl
+      disaggregation-transfer-backend: "nixl"
recipes/gb200-fp8/8k1k/low-latency.yaml (1)
84-86: Minor: nixl value is unquoted unlike other string values.

For consistency with other string values in this file (e.g., "flashinfer_trtllm"), consider quoting nixl:
💅 Suggested style fix
-      disaggregation-transfer-backend: nixl
+      disaggregation-transfer-backend: "nixl"
Apply to both prefill (line 86) and decode (line 112) sections.
Also applies to: 110-112
recipes/gb200-fp4/8k1k/max-tpt.yaml (2)
107-108: Consider quoting nixl for consistency.

The value nixl on line 108 is unquoted while "flashinfer_cutlass" on line 107 is quoted. For consistency with other string values in this file, consider quoting it.
✏️ Suggested fix
       fp4-gemm-backend: "flashinfer_cutlass"
-      disaggregation-transfer-backend: nixl
+      disaggregation-transfer-backend: "nixl"
166-167: Same quoting inconsistency as prefill section.

Apply the same fix here for consistency.
✏️ Suggested fix
       fp4-gemm-backend: "flashinfer_cutlass"
-      disaggregation-transfer-backend: nixl
+      disaggregation-transfer-backend: "nixl"
recipes/gb200-fp4/1k8k/mid-curve.yaml (1)

73-73: LGTM with minor note.

The new disaggregation-transfer-backend and fp4-gemm-backend settings are correctly configured for prefill. Minor style observation: nixl is unquoted while "flashinfer_cutlass" is quoted - consider consistent quoting for string values.

,

Also applies to: 109-109
recipes/gb200-fp4/1k1k/max-tpt.yaml (1)
6-10: Pin the nginx image tag to avoid drift.
Using the floating nginx tag can introduce unexpected behavior on upgrades.
♻️ Proposed change
-  nginx_container: nginx
+  nginx_container: "nginx:<pinned-version-or-digest>"
recipes/gb200-fp8/1k8k/mid_curve.yaml (1)

63-166: Ensure DG precompile runs mirror these runtime flags.

SGLANG_DG_CACHE_DIR implies DeepGEMM precompile is used; make sure any precompile job uses the same flags as this sglang_config (e.g., served-model-name, parallelism sizes, attention backend, etc.) to avoid stale/incorrect kernels.

Based on learnings: In SGLang backend (src/srtctl/backends/sglang.py), the sglang.deep_gemm_precompile module requires all the same flags as the actual serving run (including --served-model-name, --model-path, etc.) because it needs to mirror the runtime configuration to precompile the correct kernels.
recipes/gb200-fp8/1k8k/low_latency.yaml (1)
6-11: Pin the nginx image tag for reproducibility.

nginx without a version tag is a floating reference that resolves to the latest image, potentially introducing unreviewed changes on redeploy. Other container images in this file (e.g., lmsysorg/sglang:v0.5.8-cu130-runtime) are pinned with specific versions. Note that some recipes in the codebase use nginx-sqsh as a pinned alternative; consider whether that or a specific nginx version tag is appropriate here.
♻️ Suggested change
-  nginx_container: nginx
+  nginx_container: "nginx:<pinned_version_or_digest>"

coderabbitai · 2026-02-03T00:55:39Z

    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
    SGLANG_MOE_NVFP4_DISPATCH: "1"
-    SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions
    SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass"


⚠️ Potential issue | 🟡 Minor

Potential redundancy: SGLANG_FLASHINFER_FP4_GEMM_BACKEND env var vs fp4-gemm-backend config.

The decode_environment still contains SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" (line 63), while the new fp4-gemm-backend: "flashinfer_cutlass" is added in sglang_config for both prefill (line 110) and decode (line 237).

This creates:

Potential redundancy between env var and config option

Asymmetry: prefill_environment lacks this env var, but decode has it

If fp4-gemm-backend in sglang_config is the new preferred approach, consider removing the legacy env var from decode_environment to avoid confusion about which setting takes precedence.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp4/1k8k/max-tpt.yaml` around lines 61 - 63, Remove the legacy env var SGLANG_FLASHINFER_FP4_GEMM_BACKEND from decode_environment to avoid conflicting/ambiguous configuration with the newer sglang_config key fp4-gemm-backend; update the decode_environment block (where SGLANG_FLASHINFER_FP4_GEMM_BACKEND is set) so that only fp4-gemm-backend in sglang_config controls the GEMM backend, and ensure prefill_environment and decode_environment are symmetric (either both omit the legacy env var or both include it only if you intend to keep backward compatibility).

coderabbitai · 2026-02-03T00:55:39Z

+    SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Get the full context of the YAML file cat -n recipes/gb200-fp8/1k1k/max-tpt.yaml

Repository: ishandhanani/srt-slurm

Length of output: 6143

🏁 Script executed:

#!/bin/bash # Find any remaining references to the old DG cache dir rg -n '/configs/dg-10212025|/configs/dg-0.5.8_cu13'

Repository: ishandhanani/srt-slurm

Length of output: 4645

🏁 Script executed:

#!/bin/bash # Check for precompile configurations in the repo rg -n 'precompile|SGLANG_DG' --type yaml -A 2 -B 2

Repository: ishandhanani/srt-slurm

Length of output: 21853

🏁 Script executed:

#!/bin/bash # Look for sglang backend files to understand precompile requirements find . -path '*/srtctl/backends/sglang.py' -o -path '*/srtctl/backends/*sglang*'

Repository: ishandhanani/srt-slurm

Length of output: 97

🏁 Script executed:

#!/bin/bash # Check the sglang backend file for precompile handling head -100 src/srtctl/backends/sglang.py

Repository: ishandhanani/srt-slurm

Length of output: 3239

🏁 Script executed:

#!/bin/bash # Search for precompile-related code in the sglang backend rg -n 'precompile|deep_gemm' src/srtctl/backends/sglang.py -A 3 -B 1

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

#!/bin/bash # Look for any precompile scripts or separate precompile configurations find . -name '*precompile*' -o -name '*deepgemm*' 2>/dev/null | head -20

Repository: ishandhanani/srt-slurm

Length of output: 48

Update cache directory references across all affected recipes to ensure consistency.

Old cache directory /configs/dg-10212025 is still referenced in multiple recipes:

recipes/gb300-fp8/ (8k1k and 1k1k variants, mid/max/low-latency)

recipes/h200/ and recipes/h100/ (multiple configurations)

examples/enable-profiling.yaml

Also fix typo in recipes/gb200-fp8/1k1k/low-latency.yaml (lines 30, 47): /configsdg-0.5.8_cu13 should be /configs/dg-0.5.8_cu13 (missing slash).

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/max-tpt.yaml` around lines 30 - 31, Replace all occurrences of the old cache directory string "/configs/dg-10212025" with the new "/configs/dg-0.5.8_cu13" in every recipe and example that references it, and fix the typo "/configsdg-0.5.8_cu13" to "/configs/dg-0.5.8_cu13"; specifically search for environment variable assignments like SGLANG_DG_CACHE_DIR and any YAML values that contain the cache path and update them to the correct value to ensure consistency across recipes (including the gb300/h200/h100 variants and examples).

coderabbitai · 2026-02-03T00:55:39Z

+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
+  precision: "fp8"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

lmsysorg/sglang v0.5.8 cu130 runtime image release notes

💡 Result:

I couldn’t find any published release notes (or even an existing tag) for lmsysorg/sglang:v0.5.8-cu130-runtime on the project’s official release channels.

What I can verify as of Feb 3, 2026:

The latest SGLang GitHub release visible is v0.5.7 (dated Jan 1)—there is no v0.5.8 release entry on the releases page. [1]

The official install docs recommend the CUDA 13 runtime line via latest-cu130-runtime for CUDA 13 environments, but do not mention v0.5.8-cu130-runtime. [2]

If you meant “release notes for the CUDA 13.0 runtime image around this timeframe,” the closest matching release notes are the SGLang v0.5.7 notes on GitHub. [1]

Sources:
[1] sgl-project/sglang GitHub Releases page (shows v0.5.7 and no v0.5.8) (github.com)
[2] SGLang install docs (Docker notes; recommends latest-cu130-runtime) (docs.sglang.io)

Citations:

1: https://github.com/sgl-project/sglang/releases?utm_source=openai

2: https://docs.sglang.io/get_started/install.html?utm_source=openai

🌐 Web query:

SGLang GB200 CUDA 13.0 support compatibility

💡 Result:

Yes—SGLang can run on NVIDIA GB200 with CUDA 13.0, but the easiest/most reliable path is to use a CUDA 13-based SGLang Docker/container build rather than trying to compile everything yourself. [1], [2]

NVIDIA’s SGLang container “Release 25.11” is explicitly CUDA 13.0.2-based and lists CUDA 13.0 compatibility plus Blackwell-era support items (e.g., B300/GB300, Blackwell RTX Pro server). [1]

Driver requirement (CUDA 13.0.2 base): NVIDIA states this CUDA 13.0.2-based release requires NVIDIA driver 570+, with notes that some datacenter GPUs can use specific older driver branches in forward-compatibility mode. [1]

CUDA 13.0 itself explicitly supports Blackwell B200/GB200 (and B300/GB300). [6]

If you’re installing outside Docker, SGLang’s install docs note that for CUDA 13, Docker is recommended, and alternatively you can install a matching sgl_kernel “+cu130” wheel. [2]

SGLang platform notes also list GB200 among supported NVIDIA targets (with CUDA generally). [4]

Sources: NVIDIA SGLang container release notes (CUDA 13.0.2 + driver guidance) [1], SGLang install docs (CUDA 13 install guidance / cu130 wheels) [2], NVIDIA CUDA 13.0 blog (GB200 supported) [6], SGLang platform/support summary listing GB200 [4].

🌐 Web query:

sglang:v0.5.8-cu130-runtime docker image

💡 Result:

The official SGLang Docker images are published on Docker Hub under lmsysorg/sglang. For CUDA 13.0, the docs recommend the stable runtime tag lmsysorg/sglang:latest-cu130-runtime (or nightly lmsysorg/sglang:dev-cu13). [1]

I could not find a public listing/reference for an exact v0.5.8-cu130-runtime tag, but a closely related published tag is:

docker pull lmsysorg/sglang:v0.5.7-cu130-runtime

[2]

Update the SGLang image tag—v0.5.8 does not exist.

The tag v0.5.8-cu130-runtime is not available. The latest SGLang release is v0.5.7 (Jan 1, 2026). Use v0.5.7-cu130-runtime instead, or the recommended stable tag latest-cu130-runtime for CUDA 13.0. CUDA 13.0 and GB200 are supported, but the nonexistent version will cause image pull failures.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/low_latency.yaml` around lines 12 - 15, The container image tag "v0.5.8-cu130-runtime" under the model block (container: "lmsysorg/sglang:v0.5.8-cu130-runtime") is invalid and will cause image-pull failures; update the container value to a valid tag such as "lmsysorg/sglang:v0.5.7-cu130-runtime" or the stable "lmsysorg/sglang:latest-cu130-runtime" while keeping the model.path ("dsr1-fp8") and precision ("fp8") entries unchanged.

coderabbitai · 2026-02-03T00:55:39Z

+      context-length: 10000
+      disaggregation-mode: "prefill"
+      mem-fraction-static: 0.95
+      max-total-tokens: 8192
+      chunked-prefill-size: 8192


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and examine the YAML file find . -name "low_latency.yaml" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 107

🏁 Script executed:

# Read the YAML file with line numbers to verify the values at lines 72-76 and 122-125 wc -l recipes/gb200-fp8/1k8k/low_latency.yaml

Repository: ishandhanani/srt-slurm

Length of output: 109

🏁 Script executed:

# Get the relevant sections with context sed -n '70,80p' recipes/gb200-fp8/1k8k/low_latency.yaml

Repository: ishandhanani/srt-slurm

Length of output: 418

🏁 Script executed:

# Get lines 120-125 to see isl and osl values sed -n '120,125p' recipes/gb200-fp8/1k8k/low_latency.yaml

Repository: ishandhanani/srt-slurm

Length of output: 160

🌐 Web query:

SGLang max-total-tokens parameter meaning

💡 Result:

In SGLang, --max-total-tokens sets the maximum number of tokens that can be stored in the server’s KV-cache “memory pool” across all in-flight requests (i.e., the total cached tokens over all sequences/requests at once). This effectively caps how much concurrent context the server can keep resident; if the pool is full, new work may be queued/rejected or scheduling becomes constrained. If you don’t set it, SGLang auto-computes it based on the static GPU memory allocation (memory usage fraction) and related heuristics; it’s mainly useful for development/debugging and manual capacity control. [1], [2]

Benchmark total tokens exceed max-total-tokens capacity.

isl (1024) + osl (8192) = 9216 tokens required per benchmark request, but max-total-tokens is set to 8192. This will constrain or fail requests attempting to cache this many tokens. Raise max-total-tokens to at least 9216, or up to 10000 to match the context-length limit.

Suggested fix

- max-total-tokens: 8192 + max-total-tokens: 10000

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

context-length: 10000

disaggregation-mode: "prefill"

mem-fraction-static: 0.95

max-total-tokens: 8192

chunked-prefill-size: 8192

context-length: 10000

disaggregation-mode: "prefill"

mem-fraction-static: 0.95

max-total-tokens: 10000

chunked-prefill-size: 8192

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/low_latency.yaml` around lines 72 - 76, The benchmark requests require 1024 (isl) + 8192 (osl) = 9216 tokens but max-total-tokens is set to 8192; update the max-total-tokens field to at least 9216 (or up to 10000 to match context-length) so caching can accommodate the full request—modify the max-total-tokens value in the same YAML block containing context-length, disaggregation-mode, mem-fraction-static, and chunked-prefill-size.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp8/1k1k/low-latency.yaml`:
- Around line 13-22: The container tag in the recipe (container:
"lmsysorg/sglang:v0.5.8-cu130-runtime") is unavailable; update the container
value to a published tag such as "lmsysorg/sglang:v0.5.7-cu130-runtime" or the
moving tag "latest-cu130-runtime", or if GB200/Blackwell-specific support is
required switch to a Blackwell-compatible tag (for example a "b200-cu129"
variant); keep the same keys (path: "dsr1-fp8", precision: "fp8") and only
change the container string to one of the suggested valid tags.

🧹 Nitpick comments (2)

recipes/gb200-fp8/8k1k/max_tpt.yaml (2)

6-10: Pin the nginx image tag to avoid drift.
Line 10 uses nginx without a tag, which implicitly tracks latest and can change across deployments.

27-61: Consider deduplicating shared env vars (YAML anchors) if supported.
Lines 29–61 duplicate many keys across prefill/decode; anchors reduce drift.

♻️ Example refactor (anchors)

 backend:
+  # Shared environment variables (if YAML anchors are supported by the loader)
+  common_environment: &common_env
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    MC_TE_METRIC: "true"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    PYTHONUNBUFFERED: "1"
+
   # Prefill-specific environment variables
   prefill_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    MC_TE_METRIC: "true"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
-    PYTHONUNBUFFERED: "1"
+    <<: *common_env

   # Decode-specific environment variables
   decode_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512"
-    MC_TE_METRIC: "true"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
-    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
-    PYTHONUNBUFFERED: "1"
+    <<: *common_env
+    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512"
+    SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"

coderabbitai · 2026-02-03T21:39:51Z

+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
  precision: "fp8"

 resources:
  gpu_type: "gb200"
  prefill_nodes: 1
-  decode_nodes: 4
+  decode_nodes: 1
  prefill_workers: 1
-  decode_workers: 4
+  decode_workers: 1


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

lmsysorg/sglang v0.5.8 cu130 runtime Docker Hub

💡 Result:

I couldn’t find any Docker Hub tag for lmsysorg/sglang:v0.5.8-cu130-runtime via public indexing/search (it does not appear to be published/visible as a tag).

What is visible:

CUDA 13.0 “stable” runtime tag: lmsysorg/sglang:latest-cu130-runtime (recommended in SGLang install docs for CUDA 13 environments). [1]

Most recent versioned CUDA 13.0 runtime tag I can confirm exists: lmsysorg/sglang:v0.5.7-cu130-runtime. [2]

Pull commands:

docker pull lmsysorg/sglang:latest-cu130-runtime # or docker pull lmsysorg/sglang:v0.5.7-cu130-runtime

Sources: [1] [2]

🌐 Web query:

SGLang v0.5.8 GB200 compatibility

💡 Result:

SGLang v0.5.8 is intended to run on NVIDIA GB200/B200 (Blackwell), but GB200 compatibility mainly depends on installing a PyTorch build that supports sm_100 (Blackwell). The SGLang docs explicitly call out GB200 as a supported platform and note that pip/uv dependency resolution can accidentally install an incompatible Torch build on GB200 unless you correct it. [1], [2]

What to do for SGLang v0.5.8 on GB200

Install SGLang 0.5.8, then force-reinstall a Blackwell-capable CUDA PyTorch wheel. The SGLang install guide’s GB200 “quick fix” is:
uv pip install "torch==2.9.1" "torchvision" --extra-index-url https://download.pytorch.org/whl/cu129 --force-reinstall [1]

If you prefer containers (often easiest for new GPU arch), SGLang has published Blackwell/B200 Docker tags (e.g., b200-cu129, and cu128 variants) in their install guidance/issues. [3]

Why this matters

Older setups commonly failed on B200/GB200 because Torch builds without sm_100 support would trigger “not compatible” warnings and “no kernel is available” runtime errors. [4]

Sources: SGLang install docs (GB200 Torch fix) [1], SGLang docs (GB200 listed in hardware support) [2], SGLang repo issue with B200/Blackwell Docker tags [3], GitHub discussion showing the typical sm_100 Torch incompatibility failure mode [4].

🌐 Web query:

lmsysorg/sglang container tags available versions

💡 Result:

lmsysorg/sglang images are published on Docker Hub, and the tag set is fairly large (CUDA/ROCm variants, runtime vs full, nightly, etc.). Here are the tag families you’ll see plus how to list everything programmatically.

Common/important tags (examples)

Stable “moving” tags (recommended entry points):

latest, latest-runtime (smaller “runtime” variant) (docs.sglang.io)

CUDA 13 line examples called out in docs: dev-cu13, latest-cu130-runtime (docs.sglang.io)

Nightly tags (date + git sha style), e.g. nightly-dev-20260202-9227d4f7 (hub.docker.com) and nightly-dev-cu13-20260116-146b5fcc (hub.docker.com)

Release/versioned tags (semver-like), including ROCm variants, e.g.

v0.5.6.post1-rocm700-mi35x (hub.docker.com)

v0.5.5.post2-rocm630-mi30x (hub.docker.com)

Other hardware/stack variants also exist (the Docker Hub “recent tags” list changes over time), e.g. latest-cu130 has recently appeared (hub.docker.com)

List all available tags (recommended)

Docker Hub’s UI tags page is JS-rendered, so the most reliable way is the Docker Hub API approach (paginated):

curl -s "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags?page_size=100" \ | jq -r '.results[].name'

Then follow .next to paginate until it’s null (same general pagination pattern Docker documents for Hub APIs). (stackoverflow.com)

If you tell me whether you care about CUDA vs ROCm vs Ascend/CANN (and which CUDA version), I can narrow it to the exact tag(s) you should use.

Citations:

1: https://docs.sglang.io/get_started/install.html?utm_source=openai

2: https://docs.sglang.io/get_started/install.html?utm_source=openai

3: https://hub.docker.com/r/lmsysorg/sglang/?utm_source=openai

4: https://hub.docker.com/layers/lmsysorg/sglang/nightly-dev-cu13-20260116-146b5fcc/images/sha256-524a6b0432a5f1c51ecab656ed3c6a038cc02f7fc2ddfe5ca1e0b9d44dc98a28?utm_source=openai

5: https://hub.docker.com/layers/lmsysorg/sglang/v0.5.6.post1-rocm700-mi35x/images/sha256-a1c98407dd3545c9585ad6b594ded5594a0c2b0aa108edfde41cd6dcdce69724?utm_source=openai

6: https://hub.docker.com/layers/lmsysorg/sglang/v0.5.5.post2-rocm630-mi30x/images/sha256-659910c0c70279f25a0320f0ca8fc7411932148def3602e93ca31658da52dc44?utm_source=openai

7: https://hub.docker.com/r/lmsysorg/sglang

8: https://stackoverflow.com/questions/62795446/how-to-get-all-tags-from-docker-hub-private-repositories-as-shell-script?utm_source=openai

Use an available container tag; lmsysorg/sglang:v0.5.8-cu130-runtime does not exist on Docker Hub.

The specified tag is not published. The most recent CUDA 13.0 runtime tag available is lmsysorg/sglang:v0.5.7-cu130-runtime, or use the stable moving tag latest-cu130-runtime. If GB200 compatibility is critical, consider using Blackwell-specific tags (e.g., b200-cu129) as SGLang v0.5.8 requires PyTorch with sm_100 support for GB200 GPUs.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/low-latency.yaml` around lines 13 - 22, The container tag in the recipe (container: "lmsysorg/sglang:v0.5.8-cu130-runtime") is unavailable; update the container value to a published tag such as "lmsysorg/sglang:v0.5.7-cu130-runtime" or the moving tag "latest-cu130-runtime", or if GB200/Blackwell-specific support is required switch to a Blackwell-compatible tag (for example a "b200-cu129" variant); keep the same keys (path: "dsr1-fp8", precision: "fp8") and only change the container string to one of the suggested valid tags.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp8/1k8k/low_latency.yaml`:
- Line 64: The config currently sets trust-remote-code: true while referencing
the unpinned model deepseek-ai/DeepSeek-R1; pin the model to a specific
immutable revision or confirm the repo is immutable: update the
served-model-name for DeepSeek-R1 to include a revision hash (e.g.,
served-model-name: "deepseek-ai/DeepSeek-R1@<commit-sha>") wherever DeepSeek-R1
is referenced and keep trust-remote-code: true only when you have pinned or
vetted the upstream repository (check occurrences around lines where DeepSeek-R1
is used to update both instances).

In `@recipes/gb200-fp8/1k8k/max_tpt.yaml`:
- Around line 3-15: Update the pinned image tags for reproducibility: replace
the nginx_container value (nginx_container) with a specific nginx tag (e.g.,
nginx:1.27) instead of the implicit latest, and change the model.container value
(container: "lmsysorg/sglang:v0.5.8-cu130") to a valid SGLang tag such as
"lmsysorg/sglang:v0.5.7-cu130-runtime" or
"lmsysorg/sglang:latest-cu130-runtime"; keep dynamo.version (version: 0.8.1)
as-is but verify compatibility after changing the SGLang and nginx tags.

In `@recipes/gb200-fp8/8k1k/mid-curve.yaml`:
- Around line 1-4: The YAML's top-level name field ("gb200-8k1k-fp8-mid-tpt")
does not match the file name (mid-curve.yaml); update one to be
consistent—either rename the file to match the name value or change the name
field to reflect the file name (e.g., "gb200-8k1k-fp8-mid-curve"), ensuring the
top-level name value and the file name are identical for clarity; modify the
name field in this file (the 'name' key) or rename the file accordingly.

🧹 Nitpick comments (5)

recipes/gb200-fp8/1k1k/max-tpt.yaml (2)
115-116: Consider quoting string value for consistency.

The disaggregation-transfer-backend: nixl addition is appropriate for high-throughput disaggregated inference. However, for consistency with other string values in the config (e.g., attention-backend: "trtllm_mla", eplb-algorithm: "deepseek"), consider quoting the value.
🔧 Suggested change
-      disaggregation-transfer-backend: nixl
+      disaggregation-transfer-backend: "nixl"
167-168: Same quoting consistency note applies here.

For symmetry with the prefill section and other string values, consider quoting this value as well.
🔧 Suggested change
-      disaggregation-transfer-backend: nixl
+      disaggregation-transfer-backend: "nixl"
recipes/gb200-fp8/8k1k/low-latency.yaml (1)
61-86: Consider adding enable-symm-mem: true to prefill config.

The 1k1k variant includes enable-symm-mem: true in its prefill sglang_config (line 83), but this 8k1k variant omits it. The decode section does have it (line 104). If symmetric memory is beneficial for GB200 disaggregated inference, it should likely be enabled in prefill as well.
Suggested addition
       expert-parallel-size: 1
       fp8-gemm-backend: "flashinfer_trtllm"
+      enable-symm-mem: true
       disaggregation-bootstrap-port: 30001
       disaggregation-transfer-backend: nixl
recipes/gb200-fp8/1k1k/mid-curve.yaml (2)
3-10: Pin the nginx image and verify multi-frontend support in Dynamo 0.8.1.

Using nginx without a tag pulls a mutable “latest,” which can break reproducibility. Also confirm that Dynamo 0.8.1 supports enable_multiple_frontends, num_additional_frontends, and nginx_container to avoid silent config drift.
🔒 Example pin (replace with your chosen version or digest)
-  nginx_container: nginx
+  nginx_container: nginx:<version-or-digest>
107-115: Verify nixl backend availability in the image (prefill & decode).

Setting disaggregation-transfer-backend: nixl will hard-fail if the container lacks nixl libraries or if the SGLang build doesn’t support it. Please confirm availability in the selected image and version, and consider adding a validation guard or documented fallback.

Also applies to: 160-165

coderabbitai · 2026-02-04T01:51:55Z

+  sglang_config:
+    prefill:
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      trust-remote-code: true


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "low_latency.yaml" -type f 2>/dev/null | head -20

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

fd "low_latency.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 236

🌐 Web query:

Hugging Face transformers trust_remote_code security implications risk

💡 Result:

Setting trust_remote_code=True in Hugging Face Transformers tells from_pretrained() to download and execute Python code from the model repository (e.g., custom modeling_*.py, tokenization_*.py, etc.) so the model can define its own classes/logic. That means it’s effectively a supply-chain remote code execution (RCE) opt-in: if the repo is malicious (or later becomes malicious), that code can run with your process’ privileges (read files/keys, exfiltrate data, spawn processes, etc.). [1]

Key security implications / risks

Arbitrary code execution by design: enabling the flag allows execution of repo-provided Python—so the repo author (or anyone who compromises the repo) can run code on load/import. [1]

Mutable upstream risk: if you don’t pin a specific commit/tag (“revision”), the repo can change later and you may execute new code on a future load. Hugging Face explicitly recommends pinning a revision when using remote code. [1]

Compounding risk with unsafe weight formats: even without remote code, pickled model artifacts (common in PyTorch .bin) can enable code execution when deserialized; HF warns about “dangerous arbitrary code execution attacks” from pickle and encourages safer approaches (notably safetensors). [2][3]

Real-world abuse exists: security researchers have reported malicious/weaponized models on the Hub leveraging these classes of issues (notably pickle-based payloads). [4]

Practical mitigations / best practices

Avoid trust_remote_code unless you truly need it; prefer architectures supported natively by Transformers. [1]

If you must use it:

Inspect the repo’s Python files you’re about to execute (modeling/tokenizer code) before enabling the flag. [1]

Pin an immutable revision (commit SHA or trusted tag) to prevent surprise updates. [1]

Run loading/inference in a sandbox (container/VM), with least privilege (no secrets in env, read-only FS where possible, restricted network/egress).

Reduce artifact/deserialization risk:

Prefer safetensors and/or enforce it with use_safetensors=True (and fail if not available). [1]

Treat pickle-based weights as untrusted executables unless the publisher is trusted/verified. [2][3]

Bottom line: trust_remote_code=True is equivalent to “I agree to run this repository’s Python on my machine,” so only use it for trusted, reviewed, pinned repositories, ideally in an isolated runtime. [1]

Sources:
[1] Hugging Face Transformers SECURITY policy (remote code + revision pinning guidance)
[2] Hugging Face Hub docs: Pickle scanning + warning about arbitrary code execution
[3] Hugging Face Transformers SECURITY policy (safetensors preference / use_safetensors)
[4] Ars Technica report on malicious models and pickle-based attacks on Hugging Face Hub

🌐 Web query:

deepseek-ai DeepSeek-R1 trust_remote_code requirement

💡 Result:

Yes—loading deepseek-ai/DeepSeek-R1 via Hugging Face Transformers typically requires trust_remote_code=True, because the repo includes custom model code that Transformers must execute (otherwise you’ll get an error telling you to pass trust_remote_code=True). [1][2][3]

Example (Transformers):

from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "deepseek-ai/DeepSeek-R1" tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

Note: DeepSeek’s model card also says DeepSeek-R1 is “not directly supported” by Transformers (at least as stated there), and points users to other guidance/tools; the Distill variants are intended to be used like Qwen/Llama-family models. [1]

🏁 Script executed:

git ls-files | grep -i "low_latency.yaml"

Repository: ishandhanani/srt-slurm

Length of output: 105

🏁 Script executed:

find . -name "*.yaml" | grep -i "gb200\|low_latency" | head -20

Repository: ishandhanani/srt-slurm

Length of output: 873

🏁 Script executed:

cat -n recipes/gb200-fp8/1k8k/low_latency.yaml | sed -n '55,75p'

Repository: ishandhanani/srt-slurm

Length of output: 836

🏁 Script executed:

cat -n recipes/gb200-fp8/1k8k/low_latency.yaml | sed -n '85,100p'

Repository: ishandhanani/srt-slurm

Length of output: 679

🏁 Script executed:

cat recipes/gb200-fp8/1k8k/low_latency.yaml

Repository: ishandhanani/srt-slurm

Length of output: 3743

Pin the DeepSeek-R1 model revision or confirm repo is immutable.

trust-remote-code: true enables arbitrary code execution from the model repository. DeepSeek-R1 requires this flag, but the current configuration (lines 64 and 93) uses an unpinned model reference (deepseek-ai/DeepSeek-R1). Without pinning a specific commit or revision, future model loads may execute different code. Either add a specific revision (e.g., served-model-name: "deepseek-ai/DeepSeek-R1@<commit-sha>") or confirm the upstream repository is vetted and immutable.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/low_latency.yaml` at line 64, The config currently sets trust-remote-code: true while referencing the unpinned model deepseek-ai/DeepSeek-R1; pin the model to a specific immutable revision or confirm the repo is immutable: update the served-model-name for DeepSeek-R1 to include a revision hash (e.g., served-model-name: "deepseek-ai/DeepSeek-R1@<commit-sha>") wherever DeepSeek-R1 is referenced and keep trust-remote-code: true only when you have pinned or vetted the upstream repository (check occurrences around lines where DeepSeek-R1 is used to update both instances).

coderabbitai · 2026-02-04T01:51:55Z

+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+  nginx_container: nginx
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

nginx docker image default tag latest behavior

💡 Result:

If you don’t specify a tag, Docker uses :latest by default. So docker pull nginx and docker run nginx are equivalent to nginx:latest (you’ll see “Using default tag: latest”). [1]

latest is not special in Docker—it’s just a normal tag that the image publisher chooses to maintain; it does not inherently mean “newest,” and it can change over time. [2]

For the Docker Official Image nginx, Docker Hub documents that latest points to the “mainline” variant (i.e., the mainline build they publish as latest). [3]

Practical implication: If you want repeatable builds, don’t rely on nginx/nginx:latest; pin to a specific tag (or even a digest). [1]

Sources
[1] Docker Docs — docker image pull (default tag behavior) (docs.docker.com)
[2] How-To Geek — Explanation of Docker’s latest tag semantics (howtogeek.com)
[3] Docker Hub — nginx Official Image overview (tag mapping for latest) (hub-stage.docker.com)

Citations:

1: https://docs.docker.com/reference/cli/docker/image/pull/

2: https://www.howtogeek.com/devops/understanding-dockers-latest-tag/?utm_source=openai

3: https://hub-stage.docker.com/_/nginx

🌐 Web query:

lmsysorg sglang docker image v0.5.8-cu130 tag

💡 Result:

I can’t find any published Docker Hub tag for lmsysorg/sglang:v0.5.8-cu130 (or v0.5.8-cu130-runtime) in the public Docker Hub listing/mirrors I checked. The newest CUDA 13.0 versioned tag I can confirm is:

lmsysorg/sglang:v0.5.7-cu130-runtime (pull: docker pull lmsysorg/sglang:v0.5.7-cu130-runtime) [1]

If you just need the CUDA 13.0 stable line (not pinned to v0.5.8), SGLang’s install docs recommend:

lmsysorg/sglang:latest-cu130-runtime [2]

Sources: [1] https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime, [2] https://docs.sglang.io/get_started/install.html

🌐 Web query:

NVIDIA Dynamo 0.8.1 release

💡 Result:

NVIDIA Dynamo v0.8.1 was released on January 23, 2026 (GitHub “Latest” release). It’s a patch release focused on Kubernetes profiler improvements and bug fixes, including: model-cache PVC mounting for profiler pods, fixes to SGLang YAML boolean flag parsing, CUDA 13 SGLang container build fixes (CuDNN install logic), and a Kubernetes worker/instance ID hashing/precision issue fix. [2]

Release artifacts for v0.8.1 include updated NGC runtime containers (vLLM, SGLang, TensorRT-LLM), a dynamo-frontend:0.8.1 image, Helm chart versions 0.8.1, and PyPI wheels (ai-dynamo==0.8.1, ai-dynamo-runtime==0.8.1, kvbm==0.8.1). [1]

Also on January 23, 2026, NVIDIA published v0.8.1.post1 (no GitHub release): a patch specifically for PyPI wheels and the TensorRT-LLM runtime container, updating TRT-LLM to v1.2.0rc6.post2 while other artifacts remain at v0.8.1. [1]

Sources: NVIDIA Dynamo release artifacts/docs [1], GitHub v0.8.1 release notes [2].

Pin and correct container image tags for reproducibility and functionality.

nginx: nginx_container: nginx defaults to latest and is not reproducible. Pin to a specific version (e.g., nginx:1.27 or similar).

sglang: The tag lmsysorg/sglang:v0.5.8-cu130 does not exist on Docker Hub. The latest CUDA 13.0 versioned tag is v0.5.7-cu130-runtime. Either update to that tag or use lmsysorg/sglang:latest-cu130-runtime.

Dynamo 0.8.1: Valid and pinned; verify compatibility with SGLang and Nginx versions used.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k8k/max_tpt.yaml` around lines 3 - 15, Update the pinned image tags for reproducibility: replace the nginx_container value (nginx_container) with a specific nginx tag (e.g., nginx:1.27) instead of the implicit latest, and change the model.container value (container: "lmsysorg/sglang:v0.5.8-cu130") to a valid SGLang tag such as "lmsysorg/sglang:v0.5.7-cu130-runtime" or "lmsysorg/sglang:latest-cu130-runtime"; keep dynamo.version (version: 0.8.1) as-is but verify compatibility after changing the SGLang and nginx tags.

coderabbitai · 2026-02-04T01:51:55Z

 name: "gb200-8k1k-fp8-mid-tpt"

+dynamo:
+  version: 0.8.1


⚠️ Potential issue | 🟡 Minor

File name and name field mismatch.

The file is named mid-curve.yaml but the name field is "gb200-8k1k-fp8-mid-tpt". Consider aligning these to avoid confusion during debugging and maintenance.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/8k1k/mid-curve.yaml` around lines 1 - 4, The YAML's top-level name field ("gb200-8k1k-fp8-mid-tpt") does not match the file name (mid-curve.yaml); update one to be consistent—either rename the file to match the name value or change the name field to reflect the file name (e.g., "gb200-8k1k-fp8-mid-curve"), ensuring the top-level name value and the file name are identical for clarity; modify the name field in this file (the 'name' key) or rename the file accordingly.

* Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile

* Merge pull request #118 from ishandhanani/grho/Jan29_a configs for gb300-fp8-no-mtp * Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container and specify nginx container * Add GB200-FP8-1k8k * Update GB200 FP8 1k8k recipes * typo * only build for 9.0 * go * go * again * try again * go * Update gb200 recipes (#130) * Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile * fix container * clean up old * Add 1k1k STP and MTP disagg H100 configs (#140) * Add 1k1k STP and MTP disagg H100 configs * Update H100 FP8 configs with verified 29 Pareto-optimal points Replace previous configs with verified Pareto-optimal configurations: - 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) - 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308) Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1 * Update H100 configs to tensorrtllm-runtime:0.8.1.post3 Update all 29 H100 FP8 config files to use the new container: - nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3 * updates the recipe for Dynamo-SGLang B200 submissions * adds modified B200-fp8 recipes * updates the recipes * prune the concurrency * Add B200 MTP FP4 SGLANG recipes * Update model path cand container Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> * modify b200 sgl fp4 non-mtp configs (#168) * adds conc=128 point * adds 1p2d config * modify job name to support multiple gh runners (#182) * Add resolved B200 FP8 8k1k recipe variants for CI compatibility 14 standalone recipe files resolved from the consolidated 8k1k.yaml (main branch) for use with the sa-submission-q1-2026 srtctl which does not support zip_override syntax. STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) Made-with: Cursor * Bump MTP 8k1k health check timeout from 60min to 120min EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min of CUDA graph capture alone on the decode worker. Combined with model loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init exceeds the 60min (360 attempts x 10s) health check window on cold nodes. Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP recipe variants to provide sufficient headroom. Made-with: Cursor * Fix cuda-graph-max-bs on MTP maxtpt decode workers With data-parallel-size=8 and dp-attention, the scheduler distributes requests across 8 DP replicas. Each replica only sees max-running-requests/dp concurrent sequences, so cuda-graph-max-bs should be divided by dp accordingly. Previous values caused CUDA graph capture of 99 batch sizes per DP replica with EAGLE speculative decoding, taking 80+ minutes and exceeding the health check timeout. Corrected values capture only 35 batch sizes, finishing in ~1 minute with no performance regression. Validated: MTP 3P1D output throughput 15,124 tok/s matches reference 14,995 tok/s (+0.9%). maxtpt_0: 128 -> 16 (max-running=128, dp=8) maxtpt_1: 256 -> 32 (max-running=256, dp=8) maxtpt_2: 512 -> 64 (max-running=512, dp=8) maxtpt_3: 1024 -> 128 (max-running=1024, dp=8) Made-with: Cursor * fix rebase --------- Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com> Co-authored-by: Kyle Liang <kylliang@nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: nlevin-ui <nlevin@nvidia.com> Co-authored-by: Elnifio <elnifio0519@gmail.com> Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>

kyleliang-nv added 2 commits February 2, 2026 14:46

Update GB200-FP8 configs

0c90137

Update GB200-FP4 configs

acb398b

kyleliang-nv requested review from gracehonv and ishandhanani February 2, 2026 22:59

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

kyleliang-nv added 2 commits February 2, 2026 16:31

Add nginx container to all GB200-FP8 configs

6fa4eb6

Add nginx container to GB200-FP4 configs

55e80e9

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Cleanup configs

6ff475f

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Switch to use fast DG cache compile

41385b5

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

ishandhanani merged commit 686df21 into main Feb 4, 2026
4 of 5 checks passed

This was referenced Feb 4, 2026

Add all GB200/GB300 FP8 MTP recipes #134

Merged

Add new LHS datapoint for SGL-GB200-FP8-1k1k #148

Merged

Add B200 FP8/FP4 STP configs #162

Closed

coderabbitai bot mentioned this pull request Mar 4, 2026

Consolidate B200 recipes: merge 40 per-variant STP/MTP files into 4 combined files #206

Merged

4 tasks

coderabbitai bot mentioned this pull request Mar 18, 2026

qwen3.5 nixl configs #224

Merged

coderabbitai bot mentioned this pull request Mar 30, 2026

recipes: consolidate gb200 fp8 8k1k overrides #230

Open

		disaggregation-transfer-backend: nixl
		fp4-gemm-backend: "flashinfer_trtllm"

	SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13"
	SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"

	name: "gb200-fp8-1k1k-max-tpt"
	name: "gb200-fp8-1k8k-max-tpt"

	num_additional_frontends: 9 # Additional routers (total = 1 + t
	num_additional_frontends: 9 # Additional routers (total = 1 + this value = 10)

		disaggregation-bootstrap-port: 30001
		disaggregation-transfer-backend: nixl

		SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"
		DYN_SKIP_SGLANG_LOG_FORMATTING: "1"

Conversation

kyleliang-nv commented Feb 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

CLI example (as shown in the SGLang docs)

YAML config equivalent

1) Enable NIXL transfer backend (PD disaggregation)

2) PD-disaggregation knobs you can still use with NIXL

3) Install/bring NIXL (dependency)

v0.5.8 release notes (backend-related items)

“Backend options” you can configure (runtime/kernel backends)

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

What to do for SGLang v0.5.8 on GB200

Why this matters

Common/important tags (examples)

List all available tags (recommended)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

kyleliang-nv commented Feb 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 2, 2026 •

edited

Loading