Conversation
📝 WalkthroughWalkthroughStandardizes and renames multiple GB200 FP4/FP8 recipe YAMLs: bumps dynamo to 0.8.1, enables multi-frontends (adds nginx_container and num_additional_frontends), updates model.container to lmsysorg/sglang:v0.5.8-cu130(-runtime), removes legacy SGLANG envs, and adds fp*-gemm-backend and disaggregation-transfer-backend settings; benchmark concurrencies adjusted. Changes
Sequence Diagram(s)(omitted — changes are configuration updates not introducing new multi-component control flow warranting a sequence diagram) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp4/1k1k/low-latency.yaml`:
- Around line 81-82: Replace the invalid config key fp4-gemm-backend with the
correct moe-runner-backend: keep disaggregation-transfer-backend: nixl as-is,
remove or rename the fp4-gemm-backend entry and add moe-runner-backend:
flashinfer_trtllm so the backend is configured using the supported key (update
the entry that currently reads fp4-gemm-backend to moe-runner-backend and retain
the flashinfer_trtllm value).
In `@recipes/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 1-4: The dynamo version specified under the "dynamo" block
(dynamo.version: 0.8.1) is invalid on PyPI; update the version to a real release
(e.g., change dynamo.version to 0.8.0) in this recipe (name:
"gb200-fp4-1k8k-low-latency") and apply the same correction to other GB200
recipes that reuse the erroneous 0.8.1 value so the configuration matches an
actual PyPI release.
In `@recipes/gb200-fp8/1k1k/low-latency.yaml`:
- Line 46: The SGLANG_DG_CACHE_DIR value in the decode_environment entry
currently has a typo ("/configsdg-0.5.8_cu13"); update the environment variable
SGLANG_DG_CACHE_DIR to use the correct path "/configs/dg-0.5.8_cu13" so it
matches other recipe files and consistent cache directory naming.
- Line 29: The SGLANG_DG_CACHE_DIR value contains a typo: change the path in the
SGLANG_DG_CACHE_DIR entry from "/configsdg-0.5.8_cu13" to
"/configs/dg-0.5.8_cu13" so it matches the other recipes (e.g., mid-curve.yaml,
max-tpt.yaml) and restores the missing slash between "configs" and "dg".
In `@recipes/gb200-fp8/1k8k/max_tpt.yaml`:
- Line 12: The inline comment for the YAML key num_additional_frontends is
truncated; update the comment associated with num_additional_frontends to a
complete sentence such as "# Additional routers (total = 1 +
num_additional_frontends)" or "# Additional routers (total routers = 1 +
num_additional_frontends)" so it clearly explains that the total router count
equals the base 1 plus the value of num_additional_frontends.
- Line 1: The configuration 'name' field is incorrect (it reads
"gb200-fp8-1k1k-max-tpt") and must match the benchmark settings (isl: 1024, osl:
8192); update the name value to "gb200-fp8-1k8k-max-tpt" in this YAML (modify
the name key at the top of the file) so the config name reflects the 1k8k setup.
- Around line 70-174: The prefill and decode blocks in sglang_config are missing
the fp8-gemm-backend setting; add fp8-gemm-backend: "flashinfer_trtllm" inside
both the prefill and decode sections (near other FP8/attention settings such as
kv-cache-dtype and attention-backend) so that the fp8 GEMM backend is configured
for functions that rely on fp8 GEMM behavior.
In `@recipes/gb200-fp8/8k1k/max_tpt.yaml`:
- Around line 114-115: The YAML contains duplicate keys named
"disaggregation-bootstrap-port" within the prefill and decode mappings causing
potential collisions; remove the redundant "disaggregation-bootstrap-port"
entries so each mapping has a single port key, keeping the intended value only
once (check the prefill and decode mappings where
"disaggregation-bootstrap-port" appears twice and delete the duplicate
occurrences), and verify there are no other duplicate keys like
"disaggregation-transfer-backend" in those mappings.
In `@recipes/gb200-fp8/8k1k/mid-curve.yaml`:
- Around line 113-114: There are duplicate disaggregation-bootstrap-port entries
in the YAML (appearing in both the prefill and decode sections); remove the
duplicate key so each section contains a single unique
disaggregation-bootstrap-port entry and ensure the corresponding
disaggregation-transfer-backend (e.g., nixl) remains paired with the intended
port; update the prefill/decode blocks (where keys repeat) to keep only one
consistent declaration of disaggregation-bootstrap-port and remove the redundant
one.
🧹 Nitpick comments (8)
recipes/gb200-fp4/1k8k/max-tpt.yaml (1)
26-62: Consider extracting common environment variables to reduce duplication.The
prefill_environmentanddecode_environmentsections share many identical environment variables (lines 27-41 and 45-59). Consider extracting common variables to a shared section if the configuration format supports YAML anchors or templating, to reduce maintenance burden.recipes/gb200-fp4/8k1k/mid-curve.yaml (1)
106-107: Minor: Consider quotingnixlfor consistency.The
disaggregation-transfer-backend: nixlvalue is unquoted while other string values in this file are quoted (e.g.,"flashinfer_cutlass"). While valid YAML, quoting would improve consistency.Suggested fix
fp4-gemm-backend: "flashinfer_cutlass" - disaggregation-transfer-backend: nixl + disaggregation-transfer-backend: "nixl"recipes/gb200-fp4/8k1k/low-latency.yaml (1)
82-83: Add a comment explaining why low-latency profile usesflashinfer_trtllminstead offlashinfer_cutlass.All low-latency recipes across all profiles (8k1k, 1k8k, 1k1k) use
flashinfer_trtllm, while mid-curve and max-tpt consistently useflashinfer_cutlass. This intentional difference would benefit from a clarifying comment in the YAML to explain the architectural rationale.recipes/gb200-fp4/1k8k/low-latency.yaml (2)
6-9: Minor: Trailing whitespace on lines 6-7.Lines 6 and 7 appear to have trailing whitespace after
frontend:andtype: dynamo. While this won't affect YAML parsing, it's a minor style inconsistency.The reduction from 4 to 3 additional frontends should be validated against expected load for the 1k8k configuration.
🧹 Suggested cleanup
-frontend: - type: dynamo +frontend: + type: dynamo
82-83: LGTM on new backend configurations.The addition of
fp4-gemm-backend: "flashinfer_trtllm"anddisaggregation-transfer-backend: nixlto both prefill and decode sections is consistent and aligns with the removal ofSGLANG_ENABLE_FLASHINFER_GEMMfrom the environment variables. Explicit configuration is cleaner than environment variable flags.One minor inconsistency:
nixlis unquoted while other string values like"flashinfer_trtllm"are quoted. Consider quoting for consistency:- disaggregation-transfer-backend: nixl + disaggregation-transfer-backend: "nixl"Also applies to: 106-107
recipes/gb200-fp8/1k8k/max_tpt.yaml (1)
16-18: Consider removing commented-out container lines.Multiple commented-out container versions are left in the file. If these are no longer needed for reference, consider removing them to keep the configuration clean.
Proposed cleanup
model: path: "dsfp8" -# container: "sglang0p5p5ppost2" -# container: "sglang0p5p7" -# container: "sglang0p5p8" container: "sglang0p5p8_cu13" precision: "fp8"recipes/gb200-fp8/1k8k/mid_curve.yaml (1)
26-60: Optional: dedupe shared env vars if YAML merge keys are supported.
Both environments are nearly identical; anchors reduce drift across prefill/decode configs.♻️ Possible YAML-anchor refactor
- prefill_environment: + prefill_environment: &common_environment TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" DYN_SKIP_SGLANG_LOG_FORMATTING: "1" - MC_TE_METRIC: "true" + MC_TE_METRIC: "true" SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" MC_FORCE_MNNVL: "1" NCCL_MNNVL_ENABLE: "1" NCCL_CUMEM_ENABLE: "1" SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" PYTHONUNBUFFERED: "1" # Decode-specific environment variables decode_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + <<: *common_environment SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768" - MC_TE_METRIC: "true" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - MC_FORCE_MNNVL: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" - SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" - PYTHONUNBUFFERED: "1"recipes/gb200-fp8/1k8k/low_latency.yaml (1)
24-58: Consider YAML anchors to keep prefill/decode env in sync.
The two blocks are identical aside fromSGLANG_DECODE_BOOTSTRAP_TIMEOUT; anchoring reduces drift.♻️ Suggested refactor
backend: - prefill_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - PYTHONUNBUFFERED: "1" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" - SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" - SGLANG_ENABLE_JIT_DEEPGEMM: "false" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" - SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" - MC_TE_METRIC: "true" - MC_FORCE_MNNVL: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" - - decode_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - PYTHONUNBUFFERED: "1" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" - SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" - SGLANG_ENABLE_JIT_DEEPGEMM: "false" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" - SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" - SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" - MC_TE_METRIC: "true" - MC_FORCE_MNNVL: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" + prefill_environment: &common_environment + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" + SGLANG_ENABLE_JIT_DEEPGEMM: "false" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + + decode_environment: + <<: *common_environment + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
| disaggregation-transfer-backend: nixl | ||
| fp4-gemm-backend: "flashinfer_trtllm" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
sglang v0.5.8 flashinfer_trtllm backend configuration
💡 Result:
In SGLang v0.5.8, flashinfer_trtllm is configured as the MoE runner backend via --moe-runner-backend (it’s not the attention backend flag). The valid options list includes flashinfer_trtllm. [1]
CLI example (as shown in the SGLang docs)
python3 -m sglang.launch_server \
--model-path nvidia/DeepSeek-R1-FP4 \
--tp 8 \
--attention-backend trtllm_mla \
--moe-runner-backend flashinfer_trtllm \
--quantization modelopt_fp4 \
--prefill-attention-backend fa4This is an example of “hybrid attention” (different prefill vs decode backends) while using flashinfer_trtllm for the MoE runner. [2]
YAML config equivalent
model-path: nvidia/DeepSeek-R1-FP4
tp: 8
attention-backend: trtllm_mla
moe-runner-backend: flashinfer_trtllm
quantization: modelopt_fp4
prefill-attention-backend: fa4Launch with:
python3 -m sglang.launch_server --config config.yaml(Config-file support is documented; CLI args override YAML.) [1]
Sources: Server arguments (MoE + config support) [1]; Attention backend doc showing flashinfer_trtllm usage in an example command [2].
🌐 Web query:
sglang v0.5.8 disaggregation-transfer-backend nixl options
💡 Result:
In SGLang v0.5.8, the NIXL “options” for PD disaggregation are mainly (a) selecting NIXL as the transfer backend, plus (b) the general PD-disaggregation knobs (there aren’t extra SGLang CLI flags that are NIXL-specific in the docs).
1) Enable NIXL transfer backend (PD disaggregation)
Use:
--disaggregation-transfer-backend nixl(supported values:mooncake,nixl,ascend,fake) [1]
and run separate servers:--disaggregation-mode prefill--disaggregation-mode decode[1][2]
Docs example (single node) uses exactly these flags with NIXL. [2]
2) PD-disaggregation knobs you can still use with NIXL
These are the PD-disaggregation CLI arguments listed by SGLang (apply regardless of backend selection): [1]
--disaggregation-bootstrap-port(prefill bootstrap server port)--disaggregation-decode-tp(set on prefill server; decode TP size)--disaggregation-decode-dp(set on prefill server; decode DP size)--disaggregation-prefill-pp(set on decode server; prefill PP size)--disaggregation-decode-enable-offload-kvcache--disaggregation-decode-enable-fake-auto--num-reserved-decode-tokens--disaggregation-decode-polling-interval--disaggregation-ib-deviceexists, but its description is tied to auto-detection for the mooncake backend (not described as a NIXL control). [1]
3) Install/bring NIXL (dependency)
SGLang’s PD-disaggregation doc says to install NIXL via pip (or build from source, e.g., if you already have UCX installed). [2]
Sources:
[1] SGLang “Server Arguments” (PD disaggregation section). (docs.sglang.io)
[2] SGLang “PD Disaggregation” (NIXL requirements + usage examples). (docs.sglang.io)
Citations:
- 1: https://docs.sglang.io/advanced_features/server_arguments.html?utm_source=openai
- 2: https://docs.sglang.io/advanced_features/pd_disaggregation.html?utm_source=openai
🌐 Web query:
sglang 0.5.8 release notes backend options
💡 Result:
SGLang v0.5.8 was released Jan 23, 2026 (PyPI upload date). [1]
v0.5.8 release notes (backend-related items)
- SGLang-Diffusion: can run sglang-diffusion with the
diffusersbackend (plus multi-LoRA inference, “SLA attention backends”, CLI warmup switch, ComfyUI plugin, and broad perf improvements). [2] - Flash Attention 4: adds Flash Attention 4 decoding kernels. [2]
“Backend options” you can configure (runtime/kernel backends)
From SGLang server CLI args (see python -m sglang.launch_server --help), the main backend-selector knobs are: [3]
--model-impl:auto/sglang/transformers(model implementation choice). [3]--attention-backend: attention kernel backend. [3]prefill_attention_backendanddecode_attention_backend(experimental; override attention backend separately for prefill vs decode). [3]--sampling-backend: sampling kernel backend. [3]--grammar-backend: backend for grammar-guided decoding. [3]--mm-attention-backend: multimodal attention backend. [3]
Sources: PyPI sglang 0.5.8 release entry [1]; GitHub release notes for v0.5.8 [2]; SGLang “Server Arguments” docs (kernel backend flags) [3].
Use moe-runner-backend: flashinfer_trtllm instead of fp4-gemm-backend: flashinfer_trtllm.
In SGLang v0.5.8, flashinfer_trtllm is a valid backend value but must be configured via the moe-runner-backend option, not fp4-gemm-backend. The config key fp4-gemm-backend does not exist in the official SGLang configuration schema. The disaggregation-transfer-backend: nixl is valid and correct.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp4/1k1k/low-latency.yaml` around lines 81 - 82, Replace the
invalid config key fp4-gemm-backend with the correct moe-runner-backend: keep
disaggregation-transfer-backend: nixl as-is, remove or rename the
fp4-gemm-backend entry and add moe-runner-backend: flashinfer_trtllm so the
backend is configured using the supported key (update the entry that currently
reads fp4-gemm-backend to moe-runner-backend and retain the flashinfer_trtllm
value).
| name: "gb200-fp4-1k8k-low-latency" | ||
|
|
||
| dynamo: | ||
| version: 0.7.0 | ||
| version: 0.8.1 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if dynamo 0.8.1 exists on PyPI
curl -s "https://pypi.org/pypi/dynamo/0.8.1/json" | head -c 200Repository: ishandhanani/srt-slurm
Length of output: 90
🏁 Script executed:
# Search for other dynamo version references in the repository
rg "dynamo:" --type yaml -A 1 | head -50Repository: ishandhanani/srt-slurm
Length of output: 3082
🏁 Script executed:
cat recipes/gb200-fp4/1k8k/low-latency.yaml | head -20Repository: ishandhanani/srt-slurm
Length of output: 405
🏁 Script executed:
# Search for all dynamo version specifications with "version:" pattern
rg "dynamo:" -A 2 -B 0 | rg "version:|hash:|install:" | sort -uRepository: ishandhanani/srt-slurm
Length of output: 28548
Dynamo version 0.8.1 does not exist on PyPI and must be corrected.
While the experiment name follows the standardized convention, the dynamo configuration is invalid. The version 0.8.1 is not a valid release on PyPI (PyPI returned "Not Found"). Although this version is consistently used across multiple GB200 recipes in this PR, the underlying version must be corrected to an actual release (e.g., 0.8.0 as referenced in docs/config-reference.md) before these configurations can be used.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp4/1k8k/low-latency.yaml` around lines 1 - 4, The dynamo
version specified under the "dynamo" block (dynamo.version: 0.8.1) is invalid on
PyPI; update the version to a real release (e.g., change dynamo.version to
0.8.0) in this recipe (name: "gb200-fp4-1k8k-low-latency") and apply the same
correction to other GB200 recipes that reuse the erroneous 0.8.1 value so the
configuration matches an actual PyPI release.
| PYTHONUNBUFFERED: "1" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" | ||
| SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" | ||
| SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13" |
There was a problem hiding this comment.
Likely typo in cache directory path.
The path /configsdg-0.5.8_cu13 appears to be missing a slash between configs and dg. Other files in this PR (mid-curve.yaml, max-tpt.yaml) use /configs/dg-0.5.8_cu13.
Proposed fix
- SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13"
+ SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13" | |
| SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" |
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/low-latency.yaml` at line 29, The SGLANG_DG_CACHE_DIR
value contains a typo: change the path in the SGLANG_DG_CACHE_DIR entry from
"/configsdg-0.5.8_cu13" to "/configs/dg-0.5.8_cu13" so it matches the other
recipes (e.g., mid-curve.yaml, max-tpt.yaml) and restores the missing slash
between "configs" and "dg".
| PYTHONUNBUFFERED: "1" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" | ||
| SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" | ||
| SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13" |
There was a problem hiding this comment.
Same cache directory typo in decode_environment.
This path should also be /configs/dg-0.5.8_cu13 for consistency with other recipe files.
Proposed fix
- SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13"
+ SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SGLANG_DG_CACHE_DIR: "/configsdg-0.5.8_cu13" | |
| SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" |
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/low-latency.yaml` at line 46, The SGLANG_DG_CACHE_DIR
value in the decode_environment entry currently has a typo
("/configsdg-0.5.8_cu13"); update the environment variable SGLANG_DG_CACHE_DIR
to use the correct path "/configs/dg-0.5.8_cu13" so it matches other recipe
files and consistent cache directory naming.
| @@ -0,0 +1,182 @@ | |||
| name: "gb200-fp8-1k1k-max-tpt" | |||
There was a problem hiding this comment.
Configuration name does not match the file path.
The file is located at recipes/gb200-fp8/1k8k/max_tpt.yaml but the name is set to gb200-fp8-1k1k-max-tpt. The benchmark section confirms this is a 1k8k configuration (isl: 1024, osl: 8192), so the name should be corrected.
Proposed fix
-name: "gb200-fp8-1k1k-max-tpt"
+name: "gb200-fp8-1k8k-max-tpt"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| name: "gb200-fp8-1k1k-max-tpt" | |
| name: "gb200-fp8-1k8k-max-tpt" |
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/max_tpt.yaml` at line 1, The configuration 'name'
field is incorrect (it reads "gb200-fp8-1k1k-max-tpt") and must match the
benchmark settings (isl: 1024, osl: 8192); update the name value to
"gb200-fp8-1k8k-max-tpt" in this YAML (modify the name key at the top of the
file) so the config name reflects the 1k8k setup.
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true # Enable nginx + multiple routers | ||
| num_additional_frontends: 9 # Additional routers (total = 1 + t |
There was a problem hiding this comment.
Truncated comment.
The comment appears to be cut off mid-sentence: "# Additional routers (total = 1 + t". Please complete the comment.
Proposed fix
- num_additional_frontends: 9 # Additional routers (total = 1 + t
+ num_additional_frontends: 9 # Additional routers (total = 1 + this value = 10)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| num_additional_frontends: 9 # Additional routers (total = 1 + t | |
| num_additional_frontends: 9 # Additional routers (total = 1 + this value = 10) |
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/max_tpt.yaml` at line 12, The inline comment for the
YAML key num_additional_frontends is truncated; update the comment associated
with num_additional_frontends to a complete sentence such as "# Additional
routers (total = 1 + num_additional_frontends)" or "# Additional routers (total
routers = 1 + num_additional_frontends)" so it clearly explains that the total
router count equals the base 1 plus the value of num_additional_frontends.
| sglang_config: | ||
| prefill: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
|
|
||
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 8 | ||
| ep-size: 8 | ||
| enable-dp-attention: true | ||
|
|
||
| # KV cache and attention | ||
| attention-backend: "trtllm_mla" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
||
| # Radix cache disabled | ||
| disable-radix-cache: true | ||
|
|
||
| # Other flags | ||
| stream-interval: 50 | ||
| max-running-requests: 30000 | ||
| context-length: 10000 | ||
| watchdog-timeout: 1000000 | ||
| disable-shared-experts-fusion: true | ||
| eplb-algorithm: "deepseek" | ||
| disaggregation-bootstrap-port: 30001 | ||
|
|
||
| # Prefill-specific mode | ||
| disaggregation-mode: "prefill" | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.75 | ||
| max-total-tokens: 524288 | ||
| chunked-prefill-size: 131072 | ||
|
|
||
| # Request handling | ||
| load-balance-method: "round_robin" | ||
|
|
||
| # Performance optimizations | ||
| disable-cuda-graph: true | ||
|
|
||
| # DeepEP configuration | ||
| moe-a2a-backend: "deepep" | ||
| deepep-mode: "normal" | ||
| ep-dispatch-algorithm: "dynamic" | ||
| moe-dense-tp-size: 1 | ||
| enable-dp-lm-head: true | ||
| ep-num-redundant-experts: 32 | ||
| deepep-config: "/configs/deepep_config.json" | ||
|
|
||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| decode: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
|
|
||
| # Parallelism | ||
| tp-size: 32 | ||
| dp-size: 32 | ||
| ep-size: 32 | ||
| enable-dp-attention: true | ||
|
|
||
| # KV cache and attention | ||
| attention-backend: "trtllm_mla" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
||
| # Radix cache disabled | ||
| disable-radix-cache: true | ||
|
|
||
| # Other flags | ||
| stream-interval: 50 | ||
| decode-log-interval: 1000 | ||
| max-running-requests: 45000 | ||
| context-length: 10000 | ||
| watchdog-timeout: 1000000 | ||
| disable-shared-experts-fusion: true | ||
| eplb-algorithm: "deepseek" | ||
| disaggregation-bootstrap-port: 30001 | ||
|
|
||
| # Decode-specific mode | ||
| disaggregation-mode: "decode" | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.82 | ||
| chunked-prefill-size: 36864 | ||
|
|
||
| # DeepEP configuration | ||
| moe-a2a-backend: "deepep" | ||
| deepep-mode: "low_latency" | ||
| ep-dispatch-algorithm: "static" | ||
| moe-dense-tp-size: 1 | ||
| enable-dp-lm-head: true | ||
| prefill-round-robin-balance: true | ||
| ep-num-redundant-experts: 32 | ||
| deepep-config: "/configs/deepep_config.json" | ||
|
|
||
| # CUDA graphs | ||
| cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768] | ||
| cuda-graph-max-bs: 768 | ||
|
|
||
| disaggregation-transfer-backend: nixl |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n "fp8-gemm-backend" recipes/Repository: ishandhanani/srt-slurm
Length of output: 1546
🏁 Script executed:
sed -n '70,120p' recipes/gb200-fp8/1k8k/low_latency.yamlRepository: ishandhanani/srt-slurm
Length of output: 1666
Add missing fp8-gemm-backend configuration to both prefill and decode sections.
The fp8-gemm-backend: "flashinfer_trtllm" configuration is missing from this file. Similar GB200 FP8 recipes (e.g., gb200-fp8/1k8k/low_latency.yaml) include this setting in both prefill and decode sections.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/max_tpt.yaml` around lines 70 - 174, The prefill and
decode blocks in sglang_config are missing the fp8-gemm-backend setting; add
fp8-gemm-backend: "flashinfer_trtllm" inside both the prefill and decode
sections (near other FP8/attention settings such as kv-cache-dtype and
attention-backend) so that the fp8 GEMM backend is configured for functions that
rely on fp8 GEMM behavior.
| disaggregation-bootstrap-port: 30001 | ||
| disaggregation-transfer-backend: nixl |
There was a problem hiding this comment.
Deduplicate disaggregation-bootstrap-port to avoid YAML key collisions.
The key appears twice in both prefill and decode mappings; YAML parsers may error or last-one-wins, risking unintended config.
🧹 Proposed fix (remove the duplicate entries)
@@
- disaggregation-bootstrap-port: 30001
disaggregation-transfer-backend: nixl
@@
- disaggregation-bootstrap-port: 30001
disaggregation-transfer-backend: nixlAlso applies to: 166-167
🧰 Tools
🪛 YAMLlint (1.38.0)
[error] 114-114: duplication of key "disaggregation-bootstrap-port" in mapping
(key-duplicates)
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/8k1k/max_tpt.yaml` around lines 114 - 115, The YAML
contains duplicate keys named "disaggregation-bootstrap-port" within the prefill
and decode mappings causing potential collisions; remove the redundant
"disaggregation-bootstrap-port" entries so each mapping has a single port key,
keeping the intended value only once (check the prefill and decode mappings
where "disaggregation-bootstrap-port" appears twice and delete the duplicate
occurrences), and verify there are no other duplicate keys like
"disaggregation-transfer-backend" in those mappings.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp4/1k8k/max-tpt.yaml`:
- Around line 61-63: Remove the legacy env var
SGLANG_FLASHINFER_FP4_GEMM_BACKEND from decode_environment to avoid
conflicting/ambiguous configuration with the newer sglang_config key
fp4-gemm-backend; update the decode_environment block (where
SGLANG_FLASHINFER_FP4_GEMM_BACKEND is set) so that only fp4-gemm-backend in
sglang_config controls the GEMM backend, and ensure prefill_environment and
decode_environment are symmetric (either both omit the legacy env var or both
include it only if you intend to keep backward compatibility).
In `@recipes/gb200-fp8/1k1k/max-tpt.yaml`:
- Around line 30-31: Replace all occurrences of the old cache directory string
"/configs/dg-10212025" with the new "/configs/dg-0.5.8_cu13" in every recipe and
example that references it, and fix the typo "/configsdg-0.5.8_cu13" to
"/configs/dg-0.5.8_cu13"; specifically search for environment variable
assignments like SGLANG_DG_CACHE_DIR and any YAML values that contain the cache
path and update them to the correct value to ensure consistency across recipes
(including the gb300/h200/h100 variants and examples).
In `@recipes/gb200-fp8/1k8k/low_latency.yaml`:
- Around line 72-76: The benchmark requests require 1024 (isl) + 8192 (osl) =
9216 tokens but max-total-tokens is set to 8192; update the max-total-tokens
field to at least 9216 (or up to 10000 to match context-length) so caching can
accommodate the full request—modify the max-total-tokens value in the same YAML
block containing context-length, disaggregation-mode, mem-fraction-static, and
chunked-prefill-size.
- Around line 12-15: The container image tag "v0.5.8-cu130-runtime" under the
model block (container: "lmsysorg/sglang:v0.5.8-cu130-runtime") is invalid and
will cause image-pull failures; update the container value to a valid tag such
as "lmsysorg/sglang:v0.5.7-cu130-runtime" or the stable
"lmsysorg/sglang:latest-cu130-runtime" while keeping the model.path ("dsr1-fp8")
and precision ("fp8") entries unchanged.
🧹 Nitpick comments (9)
recipes/gb200-fp8/1k1k/low-latency.yaml (2)
88-89: Minor formatting and consistency issues.
- Line 88 appears to be an unnecessary blank line within the YAML mapping.
- The value
nixlis unquoted, while other similar string values like"flashinfer_trtllm"are quoted.For consistency, consider quoting
nixland removing the blank line.Proposed fix
moe-dense-tp-size: 1 tensor-parallel-size: 4 data-parallel-size: 1 expert-parallel-size: 1 - - disaggregation-transfer-backend: nixl + disaggregation-transfer-backend: "nixl"
114-116: Same formatting inconsistency as prefill section.The value
nixlis unquoted and there's a blank line (115) within the mapping. For consistency with other string values, consider quotingnixland removing the blank line.Proposed fix
tensor-parallel-size: 4 data-parallel-size: 1 expert-parallel-size: 1 fp8-gemm-backend: "flashinfer_trtllm" - - disaggregation-transfer-backend: nixl + disaggregation-transfer-backend: "nixl"recipes/gb200-fp8/8k1k/low-latency.yaml (1)
84-86: Minor:nixlvalue is unquoted unlike other string values.For consistency with other string values in this file (e.g.,
"flashinfer_trtllm"), consider quotingnixl:💅 Suggested style fix
- disaggregation-transfer-backend: nixl + disaggregation-transfer-backend: "nixl"Apply to both prefill (line 86) and decode (line 112) sections.
Also applies to: 110-112
recipes/gb200-fp4/8k1k/max-tpt.yaml (2)
107-108: Consider quotingnixlfor consistency.The value
nixlon line 108 is unquoted while"flashinfer_cutlass"on line 107 is quoted. For consistency with other string values in this file, consider quoting it.✏️ Suggested fix
fp4-gemm-backend: "flashinfer_cutlass" - disaggregation-transfer-backend: nixl + disaggregation-transfer-backend: "nixl"
166-167: Same quoting inconsistency as prefill section.Apply the same fix here for consistency.
✏️ Suggested fix
fp4-gemm-backend: "flashinfer_cutlass" - disaggregation-transfer-backend: nixl + disaggregation-transfer-backend: "nixl"recipes/gb200-fp4/1k8k/mid-curve.yaml (1)
73-73: LGTM with minor note.The new
disaggregation-transfer-backendandfp4-gemm-backendsettings are correctly configured for prefill. Minor style observation:nixlis unquoted while"flashinfer_cutlass"is quoted - consider consistent quoting for string values.,
Also applies to: 109-109
recipes/gb200-fp4/1k1k/max-tpt.yaml (1)
6-10: Pin the nginx image tag to avoid drift.
Using the floatingnginxtag can introduce unexpected behavior on upgrades.♻️ Proposed change
- nginx_container: nginx + nginx_container: "nginx:<pinned-version-or-digest>"recipes/gb200-fp8/1k8k/mid_curve.yaml (1)
63-166: Ensure DG precompile runs mirror these runtime flags.
SGLANG_DG_CACHE_DIRimplies DeepGEMM precompile is used; make sure any precompile job uses the same flags as thissglang_config(e.g.,served-model-name, parallelism sizes, attention backend, etc.) to avoid stale/incorrect kernels.Based on learnings: In SGLang backend (src/srtctl/backends/sglang.py), the sglang.deep_gemm_precompile module requires all the same flags as the actual serving run (including --served-model-name, --model-path, etc.) because it needs to mirror the runtime configuration to precompile the correct kernels.
recipes/gb200-fp8/1k8k/low_latency.yaml (1)
6-11: Pin the nginx image tag for reproducibility.
nginxwithout a version tag is a floating reference that resolves to the latest image, potentially introducing unreviewed changes on redeploy. Other container images in this file (e.g.,lmsysorg/sglang:v0.5.8-cu130-runtime) are pinned with specific versions. Note that some recipes in the codebase usenginx-sqshas a pinned alternative; consider whether that or a specific nginx version tag is appropriate here.♻️ Suggested change
- nginx_container: nginx + nginx_container: "nginx:<pinned_version_or_digest>"
| SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" | ||
| SGLANG_MOE_NVFP4_DISPATCH: "1" | ||
| SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions | ||
| SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" |
There was a problem hiding this comment.
Potential redundancy: SGLANG_FLASHINFER_FP4_GEMM_BACKEND env var vs fp4-gemm-backend config.
The decode_environment still contains SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" (line 63), while the new fp4-gemm-backend: "flashinfer_cutlass" is added in sglang_config for both prefill (line 110) and decode (line 237).
This creates:
- Potential redundancy between env var and config option
- Asymmetry: prefill_environment lacks this env var, but decode has it
If fp4-gemm-backend in sglang_config is the new preferred approach, consider removing the legacy env var from decode_environment to avoid confusion about which setting takes precedence.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp4/1k8k/max-tpt.yaml` around lines 61 - 63, Remove the legacy
env var SGLANG_FLASHINFER_FP4_GEMM_BACKEND from decode_environment to avoid
conflicting/ambiguous configuration with the newer sglang_config key
fp4-gemm-backend; update the decode_environment block (where
SGLANG_FLASHINFER_FP4_GEMM_BACKEND is set) so that only fp4-gemm-backend in
sglang_config controls the GEMM backend, and ensure prefill_environment and
decode_environment are symmetric (either both omit the legacy env var or both
include it only if you intend to keep backward compatibility).
| SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Get the full context of the YAML file
cat -n recipes/gb200-fp8/1k1k/max-tpt.yamlRepository: ishandhanani/srt-slurm
Length of output: 6143
🏁 Script executed:
#!/bin/bash
# Find any remaining references to the old DG cache dir
rg -n '/configs/dg-10212025|/configs/dg-0.5.8_cu13'Repository: ishandhanani/srt-slurm
Length of output: 4645
🏁 Script executed:
#!/bin/bash
# Check for precompile configurations in the repo
rg -n 'precompile|SGLANG_DG' --type yaml -A 2 -B 2Repository: ishandhanani/srt-slurm
Length of output: 21853
🏁 Script executed:
#!/bin/bash
# Look for sglang backend files to understand precompile requirements
find . -path '*/srtctl/backends/sglang.py' -o -path '*/srtctl/backends/*sglang*'Repository: ishandhanani/srt-slurm
Length of output: 97
🏁 Script executed:
#!/bin/bash
# Check the sglang backend file for precompile handling
head -100 src/srtctl/backends/sglang.pyRepository: ishandhanani/srt-slurm
Length of output: 3239
🏁 Script executed:
#!/bin/bash
# Search for precompile-related code in the sglang backend
rg -n 'precompile|deep_gemm' src/srtctl/backends/sglang.py -A 3 -B 1Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Look for any precompile scripts or separate precompile configurations
find . -name '*precompile*' -o -name '*deepgemm*' 2>/dev/null | head -20Repository: ishandhanani/srt-slurm
Length of output: 48
Update cache directory references across all affected recipes to ensure consistency.
Old cache directory /configs/dg-10212025 is still referenced in multiple recipes:
recipes/gb300-fp8/(8k1k and 1k1k variants, mid/max/low-latency)recipes/h200/andrecipes/h100/(multiple configurations)examples/enable-profiling.yaml
Also fix typo in recipes/gb200-fp8/1k1k/low-latency.yaml (lines 30, 47): /configsdg-0.5.8_cu13 should be /configs/dg-0.5.8_cu13 (missing slash).
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/max-tpt.yaml` around lines 30 - 31, Replace all
occurrences of the old cache directory string "/configs/dg-10212025" with the
new "/configs/dg-0.5.8_cu13" in every recipe and example that references it, and
fix the typo "/configsdg-0.5.8_cu13" to "/configs/dg-0.5.8_cu13"; specifically
search for environment variable assignments like SGLANG_DG_CACHE_DIR and any
YAML values that contain the cache path and update them to the correct value to
ensure consistency across recipes (including the gb300/h200/h100 variants and
examples).
| model: | ||
| path: "dsr1-fp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp8" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
lmsysorg/sglang v0.5.8 cu130 runtime image release notes
💡 Result:
I couldn’t find any published release notes (or even an existing tag) for lmsysorg/sglang:v0.5.8-cu130-runtime on the project’s official release channels.
What I can verify as of Feb 3, 2026:
- The latest SGLang GitHub release visible is
v0.5.7(dated Jan 1)—there is nov0.5.8release entry on the releases page. [1] - The official install docs recommend the CUDA 13 runtime line via
latest-cu130-runtimefor CUDA 13 environments, but do not mentionv0.5.8-cu130-runtime. [2]
If you meant “release notes for the CUDA 13.0 runtime image around this timeframe,” the closest matching release notes are the SGLang v0.5.7 notes on GitHub. [1]
Sources:
[1] sgl-project/sglang GitHub Releases page (shows v0.5.7 and no v0.5.8) (github.com)
[2] SGLang install docs (Docker notes; recommends latest-cu130-runtime) (docs.sglang.io)
Citations:
- 1: https://github.com/sgl-project/sglang/releases?utm_source=openai
- 2: https://docs.sglang.io/get_started/install.html?utm_source=openai
🌐 Web query:
SGLang GB200 CUDA 13.0 support compatibility
💡 Result:
- Yes—SGLang can run on NVIDIA GB200 with CUDA 13.0, but the easiest/most reliable path is to use a CUDA 13-based SGLang Docker/container build rather than trying to compile everything yourself. [1], [2]
- NVIDIA’s SGLang container “Release 25.11” is explicitly CUDA 13.0.2-based and lists CUDA 13.0 compatibility plus Blackwell-era support items (e.g., B300/GB300, Blackwell RTX Pro server). [1]
- Driver requirement (CUDA 13.0.2 base): NVIDIA states this CUDA 13.0.2-based release requires NVIDIA driver 570+, with notes that some datacenter GPUs can use specific older driver branches in forward-compatibility mode. [1]
- CUDA 13.0 itself explicitly supports Blackwell B200/GB200 (and B300/GB300). [6]
- If you’re installing outside Docker, SGLang’s install docs note that for CUDA 13, Docker is recommended, and alternatively you can install a matching
sgl_kernel“+cu130” wheel. [2] - SGLang platform notes also list GB200 among supported NVIDIA targets (with CUDA generally). [4]
Sources: NVIDIA SGLang container release notes (CUDA 13.0.2 + driver guidance) [1], SGLang install docs (CUDA 13 install guidance / cu130 wheels) [2], NVIDIA CUDA 13.0 blog (GB200 supported) [6], SGLang platform/support summary listing GB200 [4].
🌐 Web query:
sglang:v0.5.8-cu130-runtime docker image
💡 Result:
The official SGLang Docker images are published on Docker Hub under lmsysorg/sglang. For CUDA 13.0, the docs recommend the stable runtime tag lmsysorg/sglang:latest-cu130-runtime (or nightly lmsysorg/sglang:dev-cu13). [1]
I could not find a public listing/reference for an exact v0.5.8-cu130-runtime tag, but a closely related published tag is:
docker pull lmsysorg/sglang:v0.5.7-cu130-runtime[2]
Update the SGLang image tag—v0.5.8 does not exist.
The tag v0.5.8-cu130-runtime is not available. The latest SGLang release is v0.5.7 (Jan 1, 2026). Use v0.5.7-cu130-runtime instead, or the recommended stable tag latest-cu130-runtime for CUDA 13.0. CUDA 13.0 and GB200 are supported, but the nonexistent version will cause image pull failures.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/low_latency.yaml` around lines 12 - 15, The container
image tag "v0.5.8-cu130-runtime" under the model block (container:
"lmsysorg/sglang:v0.5.8-cu130-runtime") is invalid and will cause image-pull
failures; update the container value to a valid tag such as
"lmsysorg/sglang:v0.5.7-cu130-runtime" or the stable
"lmsysorg/sglang:latest-cu130-runtime" while keeping the model.path ("dsr1-fp8")
and precision ("fp8") entries unchanged.
| context-length: 10000 | ||
| disaggregation-mode: "prefill" | ||
| mem-fraction-static: 0.95 | ||
| max-total-tokens: 8192 | ||
| chunked-prefill-size: 8192 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find and examine the YAML file
find . -name "low_latency.yaml" | head -20Repository: ishandhanani/srt-slurm
Length of output: 107
🏁 Script executed:
# Read the YAML file with line numbers to verify the values at lines 72-76 and 122-125
wc -l recipes/gb200-fp8/1k8k/low_latency.yamlRepository: ishandhanani/srt-slurm
Length of output: 109
🏁 Script executed:
# Get the relevant sections with context
sed -n '70,80p' recipes/gb200-fp8/1k8k/low_latency.yamlRepository: ishandhanani/srt-slurm
Length of output: 418
🏁 Script executed:
# Get lines 120-125 to see isl and osl values
sed -n '120,125p' recipes/gb200-fp8/1k8k/low_latency.yamlRepository: ishandhanani/srt-slurm
Length of output: 160
🌐 Web query:
SGLang max-total-tokens parameter meaning
💡 Result:
In SGLang, --max-total-tokens sets the maximum number of tokens that can be stored in the server’s KV-cache “memory pool” across all in-flight requests (i.e., the total cached tokens over all sequences/requests at once). This effectively caps how much concurrent context the server can keep resident; if the pool is full, new work may be queued/rejected or scheduling becomes constrained. If you don’t set it, SGLang auto-computes it based on the static GPU memory allocation (memory usage fraction) and related heuristics; it’s mainly useful for development/debugging and manual capacity control. [1], [2]
Benchmark total tokens exceed max-total-tokens capacity.
isl (1024) + osl (8192) = 9216 tokens required per benchmark request, but max-total-tokens is set to 8192. This will constrain or fail requests attempting to cache this many tokens. Raise max-total-tokens to at least 9216, or up to 10000 to match the context-length limit.
Suggested fix
- max-total-tokens: 8192
+ max-total-tokens: 10000📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| context-length: 10000 | |
| disaggregation-mode: "prefill" | |
| mem-fraction-static: 0.95 | |
| max-total-tokens: 8192 | |
| chunked-prefill-size: 8192 | |
| context-length: 10000 | |
| disaggregation-mode: "prefill" | |
| mem-fraction-static: 0.95 | |
| max-total-tokens: 10000 | |
| chunked-prefill-size: 8192 |
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/low_latency.yaml` around lines 72 - 76, The benchmark
requests require 1024 (isl) + 8192 (osl) = 9216 tokens but max-total-tokens is
set to 8192; update the max-total-tokens field to at least 9216 (or up to 10000
to match context-length) so caching can accommodate the full request—modify the
max-total-tokens value in the same YAML block containing context-length,
disaggregation-mode, mem-fraction-static, and chunked-prefill-size.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k1k/low-latency.yaml`:
- Around line 13-22: The container tag in the recipe (container:
"lmsysorg/sglang:v0.5.8-cu130-runtime") is unavailable; update the container
value to a published tag such as "lmsysorg/sglang:v0.5.7-cu130-runtime" or the
moving tag "latest-cu130-runtime", or if GB200/Blackwell-specific support is
required switch to a Blackwell-compatible tag (for example a "b200-cu129"
variant); keep the same keys (path: "dsr1-fp8", precision: "fp8") and only
change the container string to one of the suggested valid tags.
🧹 Nitpick comments (2)
recipes/gb200-fp8/8k1k/max_tpt.yaml (2)
6-10: Pin the nginx image tag to avoid drift.
Line 10 usesnginxwithout a tag, which implicitly trackslatestand can change across deployments.
27-61: Consider deduplicating shared env vars (YAML anchors) if supported.
Lines 29–61 duplicate many keys across prefill/decode; anchors reduce drift.♻️ Example refactor (anchors)
backend: + # Shared environment variables (if YAML anchors are supported by the loader) + common_environment: &common_env + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + MC_TE_METRIC: "true" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + PYTHONUNBUFFERED: "1" + # Prefill-specific environment variables prefill_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" - MC_TE_METRIC: "true" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - MC_FORCE_MNNVL: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" - SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" - PYTHONUNBUFFERED: "1" + <<: *common_env # Decode-specific environment variables decode_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - SGLANG_DG_CACHE_DIR: "/configs/dg-0.5.8_cu13" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" - SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512" - MC_TE_METRIC: "true" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" - SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - MC_FORCE_MNNVL: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" - SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" - PYTHONUNBUFFERED: "1" + <<: *common_env + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512" + SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
| path: "dsr1-fp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp8" | ||
|
|
||
| resources: | ||
| gpu_type: "gb200" | ||
| prefill_nodes: 1 | ||
| decode_nodes: 4 | ||
| decode_nodes: 1 | ||
| prefill_workers: 1 | ||
| decode_workers: 4 | ||
| decode_workers: 1 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
lmsysorg/sglang v0.5.8 cu130 runtime Docker Hub
💡 Result:
I couldn’t find any Docker Hub tag for lmsysorg/sglang:v0.5.8-cu130-runtime via public indexing/search (it does not appear to be published/visible as a tag).
What is visible:
- CUDA 13.0 “stable” runtime tag:
lmsysorg/sglang:latest-cu130-runtime(recommended in SGLang install docs for CUDA 13 environments). [1] - Most recent versioned CUDA 13.0 runtime tag I can confirm exists:
lmsysorg/sglang:v0.5.7-cu130-runtime. [2]
Pull commands:
docker pull lmsysorg/sglang:latest-cu130-runtime
# or
docker pull lmsysorg/sglang:v0.5.7-cu130-runtimeSources: [1] [2]
🌐 Web query:
SGLang v0.5.8 GB200 compatibility
💡 Result:
SGLang v0.5.8 is intended to run on NVIDIA GB200/B200 (Blackwell), but GB200 compatibility mainly depends on installing a PyTorch build that supports sm_100 (Blackwell). The SGLang docs explicitly call out GB200 as a supported platform and note that pip/uv dependency resolution can accidentally install an incompatible Torch build on GB200 unless you correct it. [1], [2]
What to do for SGLang v0.5.8 on GB200
- Install SGLang 0.5.8, then force-reinstall a Blackwell-capable CUDA PyTorch wheel. The SGLang install guide’s GB200 “quick fix” is:
uv pip install "torch==2.9.1" "torchvision" --extra-index-url https://download.pytorch.org/whl/cu129 --force-reinstall[1] - If you prefer containers (often easiest for new GPU arch), SGLang has published Blackwell/B200 Docker tags (e.g.,
b200-cu129, and cu128 variants) in their install guidance/issues. [3]
Why this matters
Older setups commonly failed on B200/GB200 because Torch builds without sm_100 support would trigger “not compatible” warnings and “no kernel is available” runtime errors. [4]
Sources: SGLang install docs (GB200 Torch fix) [1], SGLang docs (GB200 listed in hardware support) [2], SGLang repo issue with B200/Blackwell Docker tags [3], GitHub discussion showing the typical sm_100 Torch incompatibility failure mode [4].
🌐 Web query:
lmsysorg/sglang container tags available versions
💡 Result:
lmsysorg/sglang images are published on Docker Hub, and the tag set is fairly large (CUDA/ROCm variants, runtime vs full, nightly, etc.). Here are the tag families you’ll see plus how to list everything programmatically.
Common/important tags (examples)
- Stable “moving” tags (recommended entry points):
latest,latest-runtime(smaller “runtime” variant) (docs.sglang.io)- CUDA 13 line examples called out in docs:
dev-cu13,latest-cu130-runtime(docs.sglang.io)
- Nightly tags (date + git sha style), e.g.
nightly-dev-20260202-9227d4f7(hub.docker.com) andnightly-dev-cu13-20260116-146b5fcc(hub.docker.com) - Release/versioned tags (semver-like), including ROCm variants, e.g.
v0.5.6.post1-rocm700-mi35x(hub.docker.com)v0.5.5.post2-rocm630-mi30x(hub.docker.com)
- Other hardware/stack variants also exist (the Docker Hub “recent tags” list changes over time), e.g.
latest-cu130has recently appeared (hub.docker.com)
List all available tags (recommended)
Docker Hub’s UI tags page is JS-rendered, so the most reliable way is the Docker Hub API approach (paginated):
curl -s "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags?page_size=100" \
| jq -r '.results[].name'Then follow .next to paginate until it’s null (same general pagination pattern Docker documents for Hub APIs). (stackoverflow.com)
If you tell me whether you care about CUDA vs ROCm vs Ascend/CANN (and which CUDA version), I can narrow it to the exact tag(s) you should use.
Citations:
- 1: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 2: https://docs.sglang.io/get_started/install.html?utm_source=openai
- 3: https://hub.docker.com/r/lmsysorg/sglang/?utm_source=openai
- 4: https://hub.docker.com/layers/lmsysorg/sglang/nightly-dev-cu13-20260116-146b5fcc/images/sha256-524a6b0432a5f1c51ecab656ed3c6a038cc02f7fc2ddfe5ca1e0b9d44dc98a28?utm_source=openai
- 5: https://hub.docker.com/layers/lmsysorg/sglang/v0.5.6.post1-rocm700-mi35x/images/sha256-a1c98407dd3545c9585ad6b594ded5594a0c2b0aa108edfde41cd6dcdce69724?utm_source=openai
- 6: https://hub.docker.com/layers/lmsysorg/sglang/v0.5.5.post2-rocm630-mi30x/images/sha256-659910c0c70279f25a0320f0ca8fc7411932148def3602e93ca31658da52dc44?utm_source=openai
- 7: https://hub.docker.com/r/lmsysorg/sglang
- 8: https://stackoverflow.com/questions/62795446/how-to-get-all-tags-from-docker-hub-private-repositories-as-shell-script?utm_source=openai
Use an available container tag; lmsysorg/sglang:v0.5.8-cu130-runtime does not exist on Docker Hub.
The specified tag is not published. The most recent CUDA 13.0 runtime tag available is lmsysorg/sglang:v0.5.7-cu130-runtime, or use the stable moving tag latest-cu130-runtime. If GB200 compatibility is critical, consider using Blackwell-specific tags (e.g., b200-cu129) as SGLang v0.5.8 requires PyTorch with sm_100 support for GB200 GPUs.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/low-latency.yaml` around lines 13 - 22, The container
tag in the recipe (container: "lmsysorg/sglang:v0.5.8-cu130-runtime") is
unavailable; update the container value to a published tag such as
"lmsysorg/sglang:v0.5.7-cu130-runtime" or the moving tag "latest-cu130-runtime",
or if GB200/Blackwell-specific support is required switch to a
Blackwell-compatible tag (for example a "b200-cu129" variant); keep the same
keys (path: "dsr1-fp8", precision: "fp8") and only change the container string
to one of the suggested valid tags.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k8k/low_latency.yaml`:
- Line 64: The config currently sets trust-remote-code: true while referencing
the unpinned model deepseek-ai/DeepSeek-R1; pin the model to a specific
immutable revision or confirm the repo is immutable: update the
served-model-name for DeepSeek-R1 to include a revision hash (e.g.,
served-model-name: "deepseek-ai/DeepSeek-R1@<commit-sha>") wherever DeepSeek-R1
is referenced and keep trust-remote-code: true only when you have pinned or
vetted the upstream repository (check occurrences around lines where DeepSeek-R1
is used to update both instances).
In `@recipes/gb200-fp8/1k8k/max_tpt.yaml`:
- Around line 3-15: Update the pinned image tags for reproducibility: replace
the nginx_container value (nginx_container) with a specific nginx tag (e.g.,
nginx:1.27) instead of the implicit latest, and change the model.container value
(container: "lmsysorg/sglang:v0.5.8-cu130") to a valid SGLang tag such as
"lmsysorg/sglang:v0.5.7-cu130-runtime" or
"lmsysorg/sglang:latest-cu130-runtime"; keep dynamo.version (version: 0.8.1)
as-is but verify compatibility after changing the SGLang and nginx tags.
In `@recipes/gb200-fp8/8k1k/mid-curve.yaml`:
- Around line 1-4: The YAML's top-level name field ("gb200-8k1k-fp8-mid-tpt")
does not match the file name (mid-curve.yaml); update one to be
consistent—either rename the file to match the name value or change the name
field to reflect the file name (e.g., "gb200-8k1k-fp8-mid-curve"), ensuring the
top-level name value and the file name are identical for clarity; modify the
name field in this file (the 'name' key) or rename the file accordingly.
🧹 Nitpick comments (5)
recipes/gb200-fp8/1k1k/max-tpt.yaml (2)
115-116: Consider quoting string value for consistency.The
disaggregation-transfer-backend: nixladdition is appropriate for high-throughput disaggregated inference. However, for consistency with other string values in the config (e.g.,attention-backend: "trtllm_mla",eplb-algorithm: "deepseek"), consider quoting the value.🔧 Suggested change
- disaggregation-transfer-backend: nixl + disaggregation-transfer-backend: "nixl"
167-168: Same quoting consistency note applies here.For symmetry with the prefill section and other string values, consider quoting this value as well.
🔧 Suggested change
- disaggregation-transfer-backend: nixl + disaggregation-transfer-backend: "nixl"recipes/gb200-fp8/8k1k/low-latency.yaml (1)
61-86: Consider addingenable-symm-mem: trueto prefill config.The 1k1k variant includes
enable-symm-mem: truein its prefill sglang_config (line 83), but this 8k1k variant omits it. The decode section does have it (line 104). If symmetric memory is beneficial for GB200 disaggregated inference, it should likely be enabled in prefill as well.Suggested addition
expert-parallel-size: 1 fp8-gemm-backend: "flashinfer_trtllm" + enable-symm-mem: true disaggregation-bootstrap-port: 30001 disaggregation-transfer-backend: nixlrecipes/gb200-fp8/1k1k/mid-curve.yaml (2)
3-10: Pin the nginx image and verify multi-frontend support in Dynamo 0.8.1.Using
nginxwithout a tag pulls a mutable “latest,” which can break reproducibility. Also confirm that Dynamo 0.8.1 supportsenable_multiple_frontends,num_additional_frontends, andnginx_containerto avoid silent config drift.🔒 Example pin (replace with your chosen version or digest)
- nginx_container: nginx + nginx_container: nginx:<version-or-digest>
107-115: Verify nixl backend availability in the image (prefill & decode).Setting
disaggregation-transfer-backend: nixlwill hard-fail if the container lacks nixl libraries or if the SGLang build doesn’t support it. Please confirm availability in the selected image and version, and consider adding a validation guard or documented fallback.Also applies to: 160-165
| sglang_config: | ||
| prefill: | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| trust-remote-code: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /tmp && find . -name "low_latency.yaml" -type f 2>/dev/null | head -20Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
fd "low_latency.yaml" -type fRepository: ishandhanani/srt-slurm
Length of output: 236
🌐 Web query:
Hugging Face transformers trust_remote_code security implications risk
💡 Result:
Setting trust_remote_code=True in Hugging Face Transformers tells from_pretrained() to download and execute Python code from the model repository (e.g., custom modeling_*.py, tokenization_*.py, etc.) so the model can define its own classes/logic. That means it’s effectively a supply-chain remote code execution (RCE) opt-in: if the repo is malicious (or later becomes malicious), that code can run with your process’ privileges (read files/keys, exfiltrate data, spawn processes, etc.). [1]
Key security implications / risks
- Arbitrary code execution by design: enabling the flag allows execution of repo-provided Python—so the repo author (or anyone who compromises the repo) can run code on load/import. [1]
- Mutable upstream risk: if you don’t pin a specific commit/tag (“revision”), the repo can change later and you may execute new code on a future load. Hugging Face explicitly recommends pinning a revision when using remote code. [1]
- Compounding risk with unsafe weight formats: even without remote code, pickled model artifacts (common in PyTorch
.bin) can enable code execution when deserialized; HF warns about “dangerous arbitrary code execution attacks” from pickle and encourages safer approaches (notablysafetensors). [2][3] - Real-world abuse exists: security researchers have reported malicious/weaponized models on the Hub leveraging these classes of issues (notably pickle-based payloads). [4]
Practical mitigations / best practices
- Avoid
trust_remote_codeunless you truly need it; prefer architectures supported natively by Transformers. [1] - If you must use it:
- Inspect the repo’s Python files you’re about to execute (modeling/tokenizer code) before enabling the flag. [1]
- Pin an immutable
revision(commit SHA or trusted tag) to prevent surprise updates. [1] - Run loading/inference in a sandbox (container/VM), with least privilege (no secrets in env, read-only FS where possible, restricted network/egress).
- Reduce artifact/deserialization risk:
- Prefer
safetensorsand/or enforce it withuse_safetensors=True(and fail if not available). [1] - Treat pickle-based weights as untrusted executables unless the publisher is trusted/verified. [2][3]
- Prefer
Bottom line: trust_remote_code=True is equivalent to “I agree to run this repository’s Python on my machine,” so only use it for trusted, reviewed, pinned repositories, ideally in an isolated runtime. [1]
Sources:
[1] Hugging Face Transformers SECURITY policy (remote code + revision pinning guidance)
[2] Hugging Face Hub docs: Pickle scanning + warning about arbitrary code execution
[3] Hugging Face Transformers SECURITY policy (safetensors preference / use_safetensors)
[4] Ars Technica report on malicious models and pickle-based attacks on Hugging Face Hub
🌐 Web query:
deepseek-ai DeepSeek-R1 trust_remote_code requirement
💡 Result:
Yes—loading deepseek-ai/DeepSeek-R1 via Hugging Face Transformers typically requires trust_remote_code=True, because the repo includes custom model code that Transformers must execute (otherwise you’ll get an error telling you to pass trust_remote_code=True). [1][2][3]
Example (Transformers):
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "deepseek-ai/DeepSeek-R1"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)Note: DeepSeek’s model card also says DeepSeek-R1 is “not directly supported” by Transformers (at least as stated there), and points users to other guidance/tools; the Distill variants are intended to be used like Qwen/Llama-family models. [1]
🏁 Script executed:
git ls-files | grep -i "low_latency.yaml"Repository: ishandhanani/srt-slurm
Length of output: 105
🏁 Script executed:
find . -name "*.yaml" | grep -i "gb200\|low_latency" | head -20Repository: ishandhanani/srt-slurm
Length of output: 873
🏁 Script executed:
cat -n recipes/gb200-fp8/1k8k/low_latency.yaml | sed -n '55,75p'Repository: ishandhanani/srt-slurm
Length of output: 836
🏁 Script executed:
cat -n recipes/gb200-fp8/1k8k/low_latency.yaml | sed -n '85,100p'Repository: ishandhanani/srt-slurm
Length of output: 679
🏁 Script executed:
cat recipes/gb200-fp8/1k8k/low_latency.yamlRepository: ishandhanani/srt-slurm
Length of output: 3743
Pin the DeepSeek-R1 model revision or confirm repo is immutable.
trust-remote-code: true enables arbitrary code execution from the model repository. DeepSeek-R1 requires this flag, but the current configuration (lines 64 and 93) uses an unpinned model reference (deepseek-ai/DeepSeek-R1). Without pinning a specific commit or revision, future model loads may execute different code. Either add a specific revision (e.g., served-model-name: "deepseek-ai/DeepSeek-R1@<commit-sha>") or confirm the upstream repository is vetted and immutable.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/low_latency.yaml` at line 64, The config currently
sets trust-remote-code: true while referencing the unpinned model
deepseek-ai/DeepSeek-R1; pin the model to a specific immutable revision or
confirm the repo is immutable: update the served-model-name for DeepSeek-R1 to
include a revision hash (e.g., served-model-name:
"deepseek-ai/DeepSeek-R1@<commit-sha>") wherever DeepSeek-R1 is referenced and
keep trust-remote-code: true only when you have pinned or vetted the upstream
repository (check occurrences around lines where DeepSeek-R1 is used to update
both instances).
| dynamo: | ||
| version: 0.8.1 | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true | ||
| num_additional_frontends: 9 | ||
| nginx_container: nginx | ||
|
|
||
| model: | ||
| path: "dsr1-fp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130" | ||
| precision: "fp8" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
nginx docker image default tag latest behavior
💡 Result:
- If you don’t specify a tag, Docker uses
:latestby default. Sodocker pull nginxanddocker run nginxare equivalent tonginx:latest(you’ll see “Using default tag: latest”). [1] latestis not special in Docker—it’s just a normal tag that the image publisher chooses to maintain; it does not inherently mean “newest,” and it can change over time. [2]- For the Docker Official Image
nginx, Docker Hub documents thatlatestpoints to the “mainline” variant (i.e., the mainline build they publish aslatest). [3]
Practical implication: If you want repeatable builds, don’t rely on nginx/nginx:latest; pin to a specific tag (or even a digest). [1]
Sources
[1] Docker Docs — docker image pull (default tag behavior) (docs.docker.com)
[2] How-To Geek — Explanation of Docker’s latest tag semantics (howtogeek.com)
[3] Docker Hub — nginx Official Image overview (tag mapping for latest) (hub-stage.docker.com)
Citations:
- 1: https://docs.docker.com/reference/cli/docker/image/pull/
- 2: https://www.howtogeek.com/devops/understanding-dockers-latest-tag/?utm_source=openai
- 3: https://hub-stage.docker.com/_/nginx
🌐 Web query:
lmsysorg sglang docker image v0.5.8-cu130 tag
💡 Result:
I can’t find any published Docker Hub tag for lmsysorg/sglang:v0.5.8-cu130 (or v0.5.8-cu130-runtime) in the public Docker Hub listing/mirrors I checked. The newest CUDA 13.0 versioned tag I can confirm is:
lmsysorg/sglang:v0.5.7-cu130-runtime(pull:docker pull lmsysorg/sglang:v0.5.7-cu130-runtime) [1]
If you just need the CUDA 13.0 stable line (not pinned to v0.5.8), SGLang’s install docs recommend:
lmsysorg/sglang:latest-cu130-runtime[2]
Sources: [1] https://docker.aityp.com/image/docker.io/lmsysorg/sglang%3Av0.5.7-cu130-runtime, [2] https://docs.sglang.io/get_started/install.html
🌐 Web query:
NVIDIA Dynamo 0.8.1 release
💡 Result:
NVIDIA Dynamo v0.8.1 was released on January 23, 2026 (GitHub “Latest” release). It’s a patch release focused on Kubernetes profiler improvements and bug fixes, including: model-cache PVC mounting for profiler pods, fixes to SGLang YAML boolean flag parsing, CUDA 13 SGLang container build fixes (CuDNN install logic), and a Kubernetes worker/instance ID hashing/precision issue fix. [2]
Release artifacts for v0.8.1 include updated NGC runtime containers (vLLM, SGLang, TensorRT-LLM), a dynamo-frontend:0.8.1 image, Helm chart versions 0.8.1, and PyPI wheels (ai-dynamo==0.8.1, ai-dynamo-runtime==0.8.1, kvbm==0.8.1). [1]
Also on January 23, 2026, NVIDIA published v0.8.1.post1 (no GitHub release): a patch specifically for PyPI wheels and the TensorRT-LLM runtime container, updating TRT-LLM to v1.2.0rc6.post2 while other artifacts remain at v0.8.1. [1]
Sources: NVIDIA Dynamo release artifacts/docs [1], GitHub v0.8.1 release notes [2].
Pin and correct container image tags for reproducibility and functionality.
- nginx:
nginx_container: nginxdefaults tolatestand is not reproducible. Pin to a specific version (e.g.,nginx:1.27or similar). - sglang: The tag
lmsysorg/sglang:v0.5.8-cu130does not exist on Docker Hub. The latest CUDA 13.0 versioned tag isv0.5.7-cu130-runtime. Either update to that tag or uselmsysorg/sglang:latest-cu130-runtime. - Dynamo 0.8.1: Valid and pinned; verify compatibility with SGLang and Nginx versions used.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k8k/max_tpt.yaml` around lines 3 - 15, Update the pinned
image tags for reproducibility: replace the nginx_container value
(nginx_container) with a specific nginx tag (e.g., nginx:1.27) instead of the
implicit latest, and change the model.container value (container:
"lmsysorg/sglang:v0.5.8-cu130") to a valid SGLang tag such as
"lmsysorg/sglang:v0.5.7-cu130-runtime" or
"lmsysorg/sglang:latest-cu130-runtime"; keep dynamo.version (version: 0.8.1)
as-is but verify compatibility after changing the SGLang and nginx tags.
| name: "gb200-8k1k-fp8-mid-tpt" | ||
|
|
||
| dynamo: | ||
| version: 0.8.1 |
There was a problem hiding this comment.
File name and name field mismatch.
The file is named mid-curve.yaml but the name field is "gb200-8k1k-fp8-mid-tpt". Consider aligning these to avoid confusion during debugging and maintenance.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/8k1k/mid-curve.yaml` around lines 1 - 4, The YAML's
top-level name field ("gb200-8k1k-fp8-mid-tpt") does not match the file name
(mid-curve.yaml); update one to be consistent—either rename the file to match
the name value or change the name field to reflect the file name (e.g.,
"gb200-8k1k-fp8-mid-curve"), ensuring the top-level name value and the file name
are identical for clarity; modify the name field in this file (the 'name' key)
or rename the file accordingly.
* Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile
* Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile
* Merge pull request #118 from ishandhanani/grho/Jan29_a configs for gb300-fp8-no-mtp * Update SGL-GB200-FP4-1k8k configs to use dynamo-sglang container and specify nginx container * Add GB200-FP8-1k8k * Update GB200 FP8 1k8k recipes * typo * only build for 9.0 * go * go * again * try again * go * Update gb200 recipes (#130) * Update GB200-FP8 configs * Update GB200-FP4 configs * Add nginx container to all GB200-FP8 configs * Add nginx container to GB200-FP4 configs * Cleanup configs * Switch to use fast DG cache compile * fix container * clean up old * Add 1k1k STP and MTP disagg H100 configs (#140) * Add 1k1k STP and MTP disagg H100 configs * Update H100 FP8 configs with verified 29 Pareto-optimal points Replace previous configs with verified Pareto-optimal configurations: - 1k1k MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - 1k1k STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) - 8k1k MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - 8k1k STP: 5 configs (conc: 6, 9, 30, 154, 308) Standardize container to nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1 * Update H100 configs to tensorrtllm-runtime:0.8.1.post3 Update all 29 H100 FP8 config files to use the new container: - nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3 * updates the recipe for Dynamo-SGLang B200 submissions * adds modified B200-fp8 recipes * updates the recipes * prune the concurrency * Add B200 MTP FP4 SGLANG recipes * Update model path cand container Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> * modify b200 sgl fp4 non-mtp configs (#168) * adds conc=128 point * adds 1p2d config * modify job name to support multiple gh runners (#182) * Add resolved B200 FP8 8k1k recipe variants for CI compatibility 14 standalone recipe files resolved from the consolidated 8k1k.yaml (main branch) for use with the sa-submission-q1-2026 srtctl which does not support zip_override syntax. STP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) MTP: 3 low-latency (1p3d/1p4d/1p6d) + 4 max-throughput (1p2d/1p1d/2p1d/3p1d) Made-with: Cursor * Bump MTP 8k1k health check timeout from 60min to 120min EAGLE speculative decoding + cuda-graph-max-bs=1024 requires ~50min of CUDA graph capture alone on the decode worker. Combined with model loading, DeepGEMM JIT warmup, and FlashInfer autotune, the total init exceeds the 60min (360 attempts x 10s) health check window on cold nodes. Increase max_attempts from 360 to 720 (7200s = 120min) on all 7 MTP recipe variants to provide sufficient headroom. Made-with: Cursor * Fix cuda-graph-max-bs on MTP maxtpt decode workers With data-parallel-size=8 and dp-attention, the scheduler distributes requests across 8 DP replicas. Each replica only sees max-running-requests/dp concurrent sequences, so cuda-graph-max-bs should be divided by dp accordingly. Previous values caused CUDA graph capture of 99 batch sizes per DP replica with EAGLE speculative decoding, taking 80+ minutes and exceeding the health check timeout. Corrected values capture only 35 batch sizes, finishing in ~1 minute with no performance regression. Validated: MTP 3P1D output throughput 15,124 tok/s matches reference 14,995 tok/s (+0.9%). maxtpt_0: 128 -> 16 (max-running=128, dp=8) maxtpt_1: 256 -> 32 (max-running=256, dp=8) maxtpt_2: 512 -> 64 (max-running=512, dp=8) maxtpt_3: 1024 -> 128 (max-running=1024, dp=8) Made-with: Cursor * fix rebase --------- Signed-off-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: Grace Ho <146482179+gracehonv@users.noreply.github.com> Co-authored-by: Kyle Liang <kylliang@nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: nlevin-ui <nlevin@nvidia.com> Co-authored-by: Elnifio <elnifio0519@gmail.com> Co-authored-by: Jatin Gangani <jgangani@dc2-container-xterm-014.prd.it.nvidia.com> Co-authored-by: yunzhoul-nv <232973175+yunzhoul-nv@users.noreply.github.com>
Summary by CodeRabbit