Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions recipies/gb200-fp4/1k8k/low-latency.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
name: "gb200-fp4-1p2d"

dynamo:
version: 0.7.0

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 4

model:
path: "dsr1"
container: "lmsysorg/sglang:nightly-dev-cu13-20260121-1e309030"
container: "lmsysorg/sglang:v0.5.5.post2"
precision: "fp4"

resources:
Expand All @@ -24,8 +32,6 @@ backend:
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
#SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
#SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
Expand All @@ -43,8 +49,6 @@ backend:
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
# SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
# SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
Expand All @@ -64,7 +68,7 @@ backend:
moe-runner-backend: "flashinfer_trtllm"
stream-interval: 10
watchdog-timeout: 1000000
context-length: 9200
context-length: 10000
mem-fraction-static: 0.95
max-total-tokens: 8192
chunked-prefill-size: 8192
Comment on lines +71 to 74
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align max-total-tokens with the new 10k context length.

context-length: 10000 exceeds max-total-tokens: 8192, which can truncate or reject longer contexts in prefill. Please set max-total-tokens ≥ 10000 (or reduce context-length).

🛠️ Proposed fix
-      max-total-tokens: 8192
+      max-total-tokens: 10000
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
context-length: 10000
mem-fraction-static: 0.95
max-total-tokens: 8192
chunked-prefill-size: 8192
context-length: 10000
mem-fraction-static: 0.95
max-total-tokens: 10000
chunked-prefill-size: 8192
🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/low-latency.yaml` around lines 71 - 74, The config
sets context-length: 10000 but max-total-tokens: 8192, which can truncate or
reject prefill; update the max-total-tokens value to be at least 10000 (or lower
context-length) so they align—modify the max-total-tokens entry in this recipe
to 10000 (or a larger value if desired) to avoid mismatched limits.

Expand All @@ -77,7 +81,6 @@ backend:
data-parallel-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1
disaggregation-transfer-backend: nixl

decode:
disaggregation-mode: "decode"
Expand All @@ -92,15 +95,14 @@ backend:
disaggregation-bootstrap-port: 30001
stream-interval: 10
watchdog-timeout: 1000000
context-length: 9200
context-length: 10000
mem-fraction-static: 0.95
chunked-prefill-size: 8192
cuda-graph-max-bs: 256
scheduler-recv-interval: 10
moe-dense-tp-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1
disaggregation-transfer-backend: nixl

benchmark:
type: "sa-bench"
Expand Down
20 changes: 12 additions & 8 deletions recipies/gb200-fp4/1k8k/max-tpt.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
# 4P1D, with 12 Decode Nodes. Uses single batch overlap

name: "gb200-fp4-max-tpt"

dynamo:
version: 0.7.0

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 9

model:
path: "dsr1"
container: "lmsysorg/sglang:nightly-dev-cu13-20260121-1e309030"
container: "lmsysorg/sglang:v0.5.5.post2"
Comment on lines +3 to +13
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sglang v0.5.5 release notes features documentation

💡 Result:

SGLang v0.5.5 — release notes (highlights / features)

From the official GitHub release “Release v0.5.5” (published Nov 6, 2025), the listed highlights are:

  • Day-0 support for Kimi-K2-Thinking and MiniMax-M2 models. [1]
  • Video + image generation (diffusion) support, linked to the LMSYS post announcing SGLang Diffusion. [1][2]
  • Blackwell kernel optimizations and MoE runner backend refactor. [1]
  • Overlap spec and prefill CUDA graph support expanded to more models. [1]
  • Gateway section: “Gateway Changes (48 commits)” and changelog range gateway-v0.2.1...gateway-v0.2.2. [1]

Documentation (official)

  • Main docs home: https://docs.sglang.io/ (also mirrored at https://docs.sglang.ai/). [3]
  • Install guide: https://docs.sglang.io/get_started/install.html. [4]
  • Diffusion models (SGLang Diffusion) docs: https://docs.sglang.io/supported_models/diffusion_models.html. [5]
  • Model Gateway / Router docs: https://docs.sglang.io/advanced_features/router.html and https://docs.sglang.io/advanced_features/sgl_model_gateway.html. [6][7]

Primary release-notes link

  • GitHub releases page (contains Release v0.5.5): https://github.com/sgl-project/sglang/releases. [1]

Sources: [1] GitHub Releases (v0.5.5) (github.com), [2] LMSYS diffusion announcement (lmsys.org), [3] Docs home (docs.sglang.io), [4] Install guide (docs.sglang.io), [5] Diffusion models docs (docs.sglang.io), [6] Model Gateway/Router docs (docs.sglang.io), [7] Model Gateway page (docs.sglang.io).

Citations:


🌐 Web query:

Dynamo 0.7.0 sglang compatibility enable_multiple_frontends

💡 Result:

For NVIDIA Dynamo v0.7.0, the supported/pinned SGLang version is 0.5.3.post4 (per the Dynamo 0.7.0 support matrix). If you’re hitting an error around enable_multiple_frontends, it’s almost always a version/flag mismatch (i.e., a flag expected by one version isn’t recognized by the other). [1]

Key points:

  • Dynamo’s SGLang worker (python -m dynamo.sglang) uses SGLang’s native argument parser, so you should only pass valid SGLang server args for the SGLang version Dynamo is using. [2]
  • In the published SGLang server arguments documentation, there is no documented enable_multiple_frontends / --enable-multiple-frontends server flag, so if something is trying to pass it, it’s likely coming from a mismatched wrapper/script or a different SGLang fork/version. [3]
  • Fix: ensure you install and run matching Dynamo/SGLang versions (e.g., install ai-dynamo==0.7.0 with the sglang extra, which is intended to align dependencies), and remove/stop passing enable_multiple_frontends. [1][2]

Sources
[1] Dynamo 0.7.0 Support Matrix (SGLang build dependency version) https://docs.nvidia.com/dynamo/archive/0.7.0/reference/support-matrix.html
[2] “Running SGLang with Dynamo” (native SGLang args passthrough) https://docs.nvidia.com/dynamo/dev/backends/sglang/README.html
[3] SGLang “Server Arguments” reference https://docs.sglang.io/advanced_features/server_arguments.html


🌐 Web query:

sglang lmsysorg container v0.5.5 supported features

💡 Result:

lmsysorg/sglang:v0.5.5 (SGLang v0.5.5) supports, per the v0.5.5 release highlights/changes:

  • Day-0 model support: Kimi-K2-Thinking and MiniMax-M2. [1]
  • Diffusion (image + video generation): “SGLang Diffusion” acceleration, with OpenAI-compatible API/CLI/Python entry points and support for major open-source video/image diffusion models (e.g., Wan, Hunyuan, Qwen-Image/Qwen-Image-Edit, Flux). [1][2]
  • NVIDIA Blackwell-focused improvements: kernel optimizations plus MoE runner backend refactor. [1]
  • Broader overlap/prefill improvements: “overlap spec” and prefill CUDA graph support for more models. [1]
  • Notable items in “What’s Changed”: decoupled quantization from vLLM dependency (GGUF SRT), direct video inference, Llama 4 + TensorRT-LLM MHA enablement, and multiple MoE runner/perf updates (incl. flashinfer_trtllm backend). [1]

Sources: [1] SGLang GitHub release v0.5.5, [2] LMSYS blog “SGLang Diffusion: Accelerating Video and Image Generation”.


Remove enable_multiple_frontends flag and pin sglang container to compatible version.

The configuration has two critical issues:

  1. enable_multiple_frontends: true is not a valid SGLang server argument—it will cause the server to fail with an unrecognized flag error. This flag does not exist in SGLang's documented server arguments.

  2. Container version mismatch: Dynamo 0.7.0 officially pins SGLang to v0.5.3.post4, but the configuration specifies lmsysorg/sglang:v0.5.5.post2. This version mismatch can cause compatibility issues. Use the pinned version or verify that v0.5.5 is compatible with your Dynamo 0.7.0 deployment.

🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/max-tpt.yaml` around lines 3 - 13, Remove the invalid
SGLang server flag enable_multiple_frontends (and related
num_additional_frontends usage) from the frontend block so the dynamo frontend
invocation doesn't pass unrecognized arguments, and change the model.container
value from "lmsysorg/sglang:v0.5.5.post2" to the Dynamo 0.7.0–pinned version
"lmsysorg/sglang:v0.5.3.post4" (or another version you have verified compatible)
so the model container matches dynamo.version 0.7.0.

precision: "fp4"

resources:
Expand Down Expand Up @@ -56,13 +62,13 @@ backend:
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
SGLANG_MOE_NVFP4_DISPATCH: "1"
SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions
SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass"

sglang_config:
prefill:
# Model configuration
served-model-name: "deepseek-ai/DeepSeek-R1"
trust-remote-code: true
disaggregation-transfer-backend: nixl

# KV cache and attention
kv-cache-dtype: "fp8_e4m3"
Expand All @@ -80,7 +86,7 @@ backend:
stream-interval: 50
decode-log-interval: 1000
watchdog-timeout: 1000000
context-length: 9200
context-length: 10000
disable-shared-experts-fusion: true
eplb-algorithm: "deepseek"
disaggregation-bootstrap-port: 30001
Expand Down Expand Up @@ -112,7 +118,6 @@ backend:
# Model configuration
served-model-name: "deepseek-ai/DeepSeek-R1"
trust-remote-code: true
disaggregation-transfer-backend: nixl

# KV cache and attention
kv-cache-dtype: "fp8_e4m3"
Expand All @@ -130,7 +135,7 @@ backend:
stream-interval: 50
decode-log-interval: 1000
watchdog-timeout: 1000000
context-length: 9200
context-length: 10000
disable-shared-experts-fusion: true
eplb-algorithm: "deepseek"
disaggregation-bootstrap-port: 30001
Expand Down Expand Up @@ -228,7 +233,6 @@ backend:
enable-dp-lm-head: true
prefill-round-robin-balance: true
enable-dp-attention: true
fp4-gemm-backend: "flashinfer_cutlass"

# Parallelism
tp-size: 48
Expand Down
21 changes: 12 additions & 9 deletions recipies/gb200-fp4/1k8k/mid-curve.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
# 4P1D, with 8 Decode Nodes. Does not use single batch overlap but allows us to currently drive higher
# per gpu throughput
name: "gb200-fp4-mid-curve"

name: "gb200-fp4-max-tpt-2"
dynamo:
version: 0.7.0

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 9

model:
path: "dsr1"
container: "lmsysorg/sglang:nightly-dev-cu13-20260121-1e309030"
container: "lmsysorg/sglang:v0.5.5.post2"
Comment on lines +3 to +13
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sglang v0.5.5.post2 release notes

💡 Result:

SGLang v0.5.5.post2 was tagged/released on Nov 12, 2025 and is primarily a patch release to “include the critical fix” from PR #12915. [1]

What changed (the “critical fix”)

PR #12915 modifies FlashInfer communication fusion defaults in python/sglang/srt/layers/flashinfer_comm_fusion.py: [2]

  • Lowered max_token_num default from 16384 → 2048 in:
    • ensure_workspace_initialized(...)
    • flashinfer_allreduce_residual_rmsnorm(...) [2]
  • Removed the fallback that disabled the fused path when input_tensor.shape[0] > max_token_num (it no longer returns None, None in that case). [2]

This PR is titled “Fix spec decoding acc length for dpsk-r1-fp4 tp8”, indicating the patch targets speculative decoding accuracy/length behavior for that configuration. [2]

Sources: GitHub release tag page for v0.5.5.post2 [1] and the referenced fix PR diff [2].


🌐 Web query:

Dynamo 0.7.0 sglang compatibility

💡 Result:

For Dynamo v0.7.0, the officially tested / build dependency SGLang version is 0.5.3.post4. [1]

Practical guidance from Dynamo’s SGLang backend docs:

  • Install via ai-dynamo[sglang] (recommended; Dynamo notes it “comes bundled with a stable sglang version”). [2]
  • If you install SGLang separately for development, Dynamo’s SGLang backend docs indicate you should use SGLang >= 0.5.3.post2 (but Dynamo’s 0.7.0 support matrix specifies 0.5.3.post4 as the known-compatible build version). [2] [1]

Feature notes for the Dynamo↔SGLang integration:

  • Core features like disaggregated serving and KV-aware routing are supported. [2]
  • KV Block Manager (KVBM) is not supported for SGLang in this release (planned). [3]

Sources: Dynamo v0.7.0 support matrix; Dynamo SGLang backend docs; Dynamo v0.7.0 release notes. [1] [2] [3]


🌐 Web query:

sglang enable_multiple_frontends support version

💡 Result:

enable_multiple_frontends is not a supported SGLang server argument in any currently documented SGLang release (it does not appear in the official “Server Arguments” reference / sglang.launch_server --help surface). [1]

If what you want is “multiple frontends / multiple entrypoints / multi-tenant traffic”, SGLang’s documented solution is to run workers behind the SGLang Model Gateway (Router) rather than a single launch_server flag. [2]

Sources: [1] [2]


Update container to use sglang 0.5.3.post4 and configure multi-frontend serving correctly.

Dynamo 0.7.0 officially supports sglang 0.5.3.post4, not v0.5.5.post2. Additionally, enable_multiple_frontends and num_additional_frontends are not supported SGLang server arguments. For multi-tenant serving, use the SGLang Model Gateway (Router) instead of these flags.

🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/mid-curve.yaml` around lines 3 - 13, Update the model
container and remove unsupported multi-frontend flags: change the
model.container value from "lmsysorg/sglang:v0.5.5.post2" to
"lmsysorg/sglang:0.5.3.post4", and remove the unsupported frontend keys
enable_multiple_frontends and num_additional_frontends; for
multi-tenant/multi-frontend serving, configure and route traffic through the
SGLang Model Gateway (Router) instead of using those flags.

precision: "fp4"

resources:
Expand Down Expand Up @@ -57,6 +62,7 @@ backend:
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
SGLANG_MOE_NVFP4_DISPATCH: "1"
SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions
SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass"

sglang_config:
prefill:
Expand All @@ -67,7 +73,6 @@ backend:
# KV cache and attention
kv-cache-dtype: "fp8_e4m3"
attention-backend: "trtllm_mla"
disaggregation-transfer-backend: nixl

# Quantization
quantization: "modelopt_fp4"
Expand All @@ -81,7 +86,7 @@ backend:
stream-interval: 50
decode-log-interval: 1000
watchdog-timeout: 1000000
context-length: 9200
context-length: 10000
disable-shared-experts-fusion: true
eplb-algorithm: "deepseek"
disaggregation-bootstrap-port: 30001
Expand Down Expand Up @@ -117,7 +122,6 @@ backend:
# KV cache and attention
kv-cache-dtype: "fp8_e4m3"
attention-backend: "trtllm_mla"
disaggregation-transfer-backend: nixl

# Quantization
quantization: "modelopt_fp4"
Expand All @@ -131,7 +135,7 @@ backend:
stream-interval: 50
decode-log-interval: 1000
watchdog-timeout: 1000000
context-length: 9200
context-length: 10000
disable-shared-experts-fusion: true
eplb-algorithm: "deepseek"
disaggregation-bootstrap-port: 30001
Expand Down Expand Up @@ -228,7 +232,6 @@ backend:
enable-dp-lm-head: true
prefill-round-robin-balance: true
enable-dp-attention: true
fp4-gemm-backend: "flashinfer_cutlass"

# Parallelism
tp-size: 32
Expand Down
Loading