Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions recipes/gb200-fp8/1k1k/low-latency-mtp.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
name: "gb200-fp8-1p-2d-low-latency-mtp"

frontend:
nginx_container: nginx

model:
path: "dsfp8"
container: "0.5.8"
container: "lmsysorg/sglang:v0.5.8-cu130"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

SGLang v0.5.8 Docker image tags lmsysorg CUDA 13

💡 Result:

I couldn’t find any published lmsysorg/sglang Docker tag for v0.5.8 that targets CUDA 13 (e.g., no evidence of a v0.5.8-cu13* / v0.5.8-cu130* tag in the public listings I could access).

What is documented/visible for CUDA 13 with lmsysorg/sglang:

  • Nightly / dev (CUDA 13): lmsysorg/sglang:dev-cu13 (docs.sglang.io)
  • Stable (CUDA 13.0 runtime line): lmsysorg/sglang:latest-cu130-runtime (docs.sglang.io)
  • Versioned CUDA 13.0 runtime tag I can confirm exists (latest I saw): lmsysorg/sglang:v0.5.7-cu130-runtime (docker.aityp.com)

If you want, tell me whether you need build image vs -runtime, and I can try to enumerate any additional *-cu13 / *-cu130-* tags that are currently published.

Citations:


🌐 Web query:

Docker Hub lmsysorg/sglang v0.5.8 available tags cu130 runtime

💡 Result:

I couldn’t find any evidence that a v0.5.8-*-cu130-runtime tag exists for lmsysorg/sglang on Docker Hub (no search hits for v0.5.8-cu130-runtime, and the Docker Hub repo page doesn’t surface any v0.5.8 tags). [1]

What is referenced/available for CUDA 13.0 runtime:

  • Stable (recommended) CUDA 13 runtime tag: lmsysorg/sglang:latest-cu130-runtime (called out in the official install docs for “CUDA 13 environment”). [2]
  • Latest versioned CUDA 13 runtime tag I could find: lmsysorg/sglang:v0.5.7-cu130-runtime (listed by a Docker Hub mirror page that tracks that exact tag). [3]

Sources:
[1] Docker Hub repo page for lmsysorg/sglang (hub.docker.com)
[2] SGLang install docs mentioning latest-cu130-runtime (docs.sglang.io)
[3] Mirror page showing v0.5.7-cu130-runtime exists (docker.aityp.com)

Citations:


🌐 Web query:

SGLang CUDA 13 Docker image runtime variant "-runtime" suffix

💡 Result:

In SGLang’s official Docker Hub images (lmsysorg/sglang), the -runtime suffix means the production “runtime” variant of an image tag (e.g., latest-runtime) that is smaller (~40%) because it excludes build tools and development dependencies. [1]

For CUDA 13, SGLang recommends using either: [1]

  • Nightly/dev CUDA 13 image: lmsysorg/sglang:dev-cu13
  • Stable CUDA 13 runtime image: lmsysorg/sglang:latest-cu130-runtime (note cu130 + -runtime)

Related naming in NVIDIA’s Dynamo artifacts also uses a “runtime” image plus a CUDA 13 tag (e.g., sglang-runtime:<version>-cuda13). [2]


Update the SGLang Docker image tag for CUDA 13 with the -runtime suffix.

The tag lmsysorg/sglang:v0.5.8-cu130 does not exist in published repositories. For CUDA 13, SGLang images require the -runtime suffix (e.g., lmsysorg/sglang:v0.5.8-cu130-runtime). The latest confirmed versioned CUDA 13 runtime tag is lmsysorg/sglang:v0.5.7-cu130-runtime. Verify whether v0.5.8 with CUDA 13 runtime has been released; if not, use the v0.5.7 tag or check the official SGLang documentation for the recommended CUDA 13 image.

🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml` at line 5, The Docker image tag
for SGLang is invalid: replace the container value
"lmsysorg/sglang:v0.5.8-cu130" with the CUDA13 runtime-suffixed tag (e.g.,
"lmsysorg/sglang:v0.5.8-cu130-runtime"); if v0.5.8-cu130-runtime is not
published, use the confirmed available tag
"lmsysorg/sglang:v0.5.7-cu130-runtime" instead and verify against the official
SGLang image tags; update the container field accordingly.

precision: "fp8"

resources:
Expand All @@ -18,7 +21,6 @@ backend:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
PYTHONUNBUFFERED: "1"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_ENABLE_FLASHINFER_GEMM: "1"
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
Expand All @@ -32,12 +34,13 @@ backend:
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_ENABLE_SPEC_V2: "1"
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"

decode_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
PYTHONUNBUFFERED: "1"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_ENABLE_FLASHINFER_GEMM: "1"
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
Expand All @@ -53,6 +56,8 @@ backend:
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_ENABLE_SPEC_V2: "1"
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"

sglang_config:
prefill:
Expand Down Expand Up @@ -81,6 +86,7 @@ backend:
tensor-parallel-size: 4
data-parallel-size: 1
expert-parallel-size: 1
disaggregation-transfer-backend: "nixl"
speculative-algorithm: "EAGLE"
speculative-num-steps: 2
speculative-eagle-topk: 1
Expand Down Expand Up @@ -110,6 +116,7 @@ backend:
tensor-parallel-size: 8
data-parallel-size: 1
expert-parallel-size: 1
disaggregation-transfer-backend: "nixl"
speculative-algorithm: "EAGLE"
speculative-num-steps: 2
speculative-eagle-topk: 1
Expand Down
60 changes: 13 additions & 47 deletions recipes/gb200-fp8/1k1k/max-tpt-2p1d-mtp.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
# GB200 FP8 Max Throughput Configuration

name: "gb200-fp8-max-tpt-mtp"
name: "gb200-fp8-max-tpt-2p1d-mtp"

frontend:
nginx_container: nginx

model:
path: "dsfp8"
container: "0.5.8"
container: "lmsysorg/sglang:v0.5.8-cu130"
precision: "fp8"

resources:
Expand All @@ -19,7 +22,6 @@ backend:
# Prefill-specific environment variables
prefill_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
SGLANG_DG_CACHE_DIR: "/configs/dg-01232026-{node_id}"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
MC_TE_METRIC: "true"
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
Expand All @@ -36,11 +38,11 @@ backend:
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH: "1"
SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO: "1"
FLASHINFER_WORKSPACE_BASE: "/configs/"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"

# Decode-specific environment variables
decode_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
SGLANG_DG_CACHE_DIR: "/configs/dg-01232026-{node_id}"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768"
MC_TE_METRIC: "true"
Expand All @@ -60,6 +62,7 @@ backend:
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH: "1"
SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO: "1"
FLASHINFER_WORKSPACE_BASE: "/configs/"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"

sglang_config:
prefill:
Expand Down Expand Up @@ -92,7 +95,7 @@ backend:

# Prefill-specific mode
disaggregation-mode: "prefill"
#disaggregation-transfer-backend: "nixl"
disaggregation-transfer-backend: "nixl"

# Memory and token limits
mem-fraction-static: 0.75
Expand Down Expand Up @@ -153,7 +156,7 @@ backend:

# Decode-specific mode
disaggregation-mode: "decode"
#disaggregation-transfer-backend: "nixl"
disaggregation-transfer-backend: "nixl"

# Memory and token limits
mem-fraction-static: 0.75
Expand All @@ -167,54 +170,17 @@ backend:
moe-dense-tp-size: 1
enable-dp-lm-head: true
prefill-round-robin-balance: true
ep-num-redundant-experts: 32
deepep-config: "/configs/deepep_config.json"

# CUDA graphs
cuda-graph-bs: [
1,
2,
4,
8,
16,
24,
32,
40,
48,
56,
64,
72,
80,
88,
96,
104,
112,
120,
128,
136,
144,
152,
160,
168,
176,
184,
192,
200,
208,
216,
224,
232,
240,
248,
256,
] #, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768]
cuda-graph-max-bs: 256
cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384]
cuda-graph-max-bs: 384

# MTP
speculative-algorithm: "EAGLE"
speculative-num-steps: 2
speculative-num-steps: 1
speculative-eagle-topk: 1
speculative-num-draft-tokens: 3
speculative-num-draft-tokens: 2

benchmark:
type: "sa-bench"
Expand Down
186 changes: 186 additions & 0 deletions recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# GB200 FP8 Max Throughput Configuration

name: "gb200-fp8-mid-curve-3p1d-mtp"
Comment on lines +1 to +3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent header comment.

The comment on Line 1 says "Max Throughput Configuration" but the name and filename indicate this is a "mid-curve" configuration. This appears to be a copy-paste artifact.

Proposed fix
-# GB200 FP8 Max Throughput Configuration
+# GB200 FP8 Mid-Curve 3P1D MTP Configuration
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# GB200 FP8 Max Throughput Configuration
name: "gb200-fp8-mid-curve-3p1d-mtp"
# GB200 FP8 Mid-Curve 3P1D MTP Configuration
name: "gb200-fp8-mid-curve-3p1d-mtp"
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml` around lines 1 - 3, The top
comment incorrectly reads "Max Throughput Configuration"; update the header
comment to match the actual recipe type by changing it to indicate "Mid-Curve
Configuration" (or similar) so it aligns with the name
"gb200-fp8-mid-curve-3p1d-mtp" and the filename; ensure the descriptive comment
at the top reflects mid-curve rather than max-throughput for clarity.


frontend:
nginx_container: nginx

model:
path: "dsfp8"
container: "lmsysorg/sglang:v0.5.8-cu130"
precision: "fp8"

resources:
gpu_type: "gb200"
prefill_nodes: 6
prefill_workers: 3
decode_nodes: 12
decode_workers: 1
gpus_per_node: 4

backend:

# Prefill-specific environment variables
prefill_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
MC_TE_METRIC: "true"
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
PYTHONUNBUFFERED: "1"
SGLANG_ENABLE_SPEC_V2: "1"
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"

# Decode-specific environment variables
decode_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "768"
MC_TE_METRIC: "true"
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
PYTHONUNBUFFERED: "1"
SGLANG_ENABLE_SPEC_V2: "1"
SGLANG_NCCL_ALL_GATHER_IN_OVERLAP_SCHEDULER_SYNC_BATCH: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"

sglang_config:
prefill:
# Model configuration
served-model-name: "deepseek-ai/DeepSeek-R1"
skip-tokenizer-init: true
trust-remote-code: true

# Parallelism
tp-size: 8
dp-size: 8
ep-size: 8
enable-dp-attention: true

# KV cache and attention
attention-backend: "trtllm_mla"
kv-cache-dtype: "fp8_e4m3"

# Radix cache disabled
disable-radix-cache: true

# Other flags
stream-interval: 50
max-running-requests: 30000
context-length: 2200
watchdog-timeout: 1000000
disable-shared-experts-fusion: true
eplb-algorithm: "deepseek"
disaggregation-bootstrap-port: 30001

# Prefill-specific mode
disaggregation-mode: "prefill"
disaggregation-transfer-backend: "nixl"

# Memory and token limits
mem-fraction-static: 0.75
max-total-tokens: 524288
chunked-prefill-size: 131072

# Request handling
load-balance-method: "round_robin"

# Performance optimizations
disable-cuda-graph: true

# DeepEP configuration
moe-a2a-backend: "deepep"
deepep-mode: "normal"
ep-dispatch-algorithm: "dynamic"
moe-dense-tp-size: 1
enable-dp-lm-head: true
ep-num-redundant-experts: 32
deepep-config: "/configs/deepep_config.json"

# MTP
speculative-algorithm: "EAGLE"
speculative-num-steps: 1
speculative-eagle-topk: 1
speculative-num-draft-tokens: 2

decode:
# Model configuration
served-model-name: "deepseek-ai/DeepSeek-R1"
skip-tokenizer-init: true
trust-remote-code: true

# Parallelism
tp-size: 48
dp-size: 48
ep-size: 48
enable-dp-attention: true

# KV cache and attention
attention-backend: "trtllm_mla"
kv-cache-dtype: "fp8_e4m3"

# Radix cache disabled
disable-radix-cache: true

# Other flags
stream-interval: 50
decode-log-interval: 1000
max-running-requests: 45000
context-length: 2200
watchdog-timeout: 1000000
disable-shared-experts-fusion: true
eplb-algorithm: "deepseek"
disaggregation-bootstrap-port: 30001

# Decode-specific mode
disaggregation-mode: "decode"
disaggregation-transfer-backend: "nixl"

# Memory and token limits
mem-fraction-static: 0.75
chunked-prefill-size: 36864

Comment on lines +156 to +159
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing max-total-tokens in decode section.

The prefill section specifies max-total-tokens: 524288 (line 96), but the decode section lacks this parameter. The comparable max-tpt-2p1d-mtp.yaml has max-total-tokens: 1703116 for decode. If this omission is intentional (relying on a default), consider adding a comment; otherwise, add the appropriate value.

🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/mid-curve-3p1d-mtp.yaml` around lines 153 - 156, The
decode section is missing the max-total-tokens parameter; add a line under the
decode block setting max-total-tokens: 524288 (to match the prefill section's
max-total-tokens) so the decode block explicitly defines the same total-token
limit (compare with the prefill max-total-tokens and the other recipe
max-tpt-2p1d-mtp.yaml if you intend a different value).

# DeepEP configuration
moe-a2a-backend: "deepep"
deepep-mode: "low_latency"
ep-dispatch-algorithm: "static"
moe-dense-tp-size: 1
enable-dp-lm-head: true
prefill-round-robin-balance: true
ep-num-redundant-experts: 32
deepep-config: "/configs/deepep_config.json"

# CUDA graphs
cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384]
cuda-graph-max-bs: 384

# MTP
speculative-algorithm: "EAGLE"
speculative-num-steps: 1
speculative-eagle-topk: 1
speculative-num-draft-tokens: 2

benchmark:
type: "sa-bench"
isl: 1024
osl: 1024
concurrencies: "1024x2048x4096"
req_rate: "inf"

Loading
Loading